QUICK REVIEW

[论文解读] Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?

Hirokatsu Kataoka, Tenga Wakamiya|arXiv (Cornell University)|Apr 10, 2020

Human Pose and Action Recognition参考文献 31被引用 82

一句话总结

本论文研究大规模、经过精心标注的视频数据集（pre-training）以及数据集融合，如何影响时空3D CNN的迁移学习性能，在标准基准上显示收益，并指出对非常深的模型的深度限制。

ABSTRACT

How can we collect and use a video dataset to further improve spatiotemporal 3D Convolutional Neural Networks (3D CNNs)? In order to positively answer this open question in video recognition, we have conducted an exploration study using a couple of large-scale video datasets and 3D CNNs. In the early era of deep neural networks, 2D CNNs have been better than 3D CNNs in the context of video recognition. Recent studies revealed that 3D CNNs can outperform 2D CNNs trained on a large-scale video dataset. However, we heavily rely on architecture exploration instead of dataset consideration. Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy. We reveal that a carefully annotated dataset (e.g., Kinetics-700) effectively pre-trains a video representation for a video classification task. (ii) We confirm the relationships between #category/#instance and video classification accuracy. The results show that #category should initially be fixed, and then #instance is increased on a video dataset in case of dataset construction. (iii) In order to practically extend a video dataset, we simply concatenate publicly available datasets, such as Kinetics-700 and Moments in Time (MiT) datasets. Compared with Kinetics-700 pre-training, we further enhance spatiotemporal 3D CNNs with the merged dataset, e.g., +0.9, +3.4, and +1.1 on UCF-101, HMDB-51, and ActivityNet datasets, respectively, in terms of fine-tuning. (iv) In terms of recognition architecture, the Kinetics-700 and merged dataset pre-trained models increase the recognition performance to 200 layers with the Residual Network (ResNet), while the Kinetics-400 pre-trained model cannot successfully optimize the 200-layer architecture.

研究动机与目标

评估哪些大规模预训练数据集最能迁移到标准视频基准的微调。
检查预训练中的类别数和实例数如何影响性能。
测试简单的数据集融合以增加预训练数据量并评估影响。
探索在不同预训练 regime 下，增加模型深度（层数）对3D CNN的影响。
在大规模预训练下比较 3D-ResNet 与 (2+1)D 架构。

提出的方法

在 Kinetics-700、MiT、STAIR 和 Mini-HVU 数据集上预训练 3D-ResNet 变体。
在 UCF-101、HMDB-51 和 ActivityNet 上微调以衡量迁移性能。
系统地改变 #category 与 #instance 以研究数据量对准确性的影响。
创建合并的预训练数据集（例如 K+M、K+M+S），并与单数据集预训练进行比较。
评估模型深度（ResNet-18 到 ResNet-200）并比较 3D-ResNet 与 (2+1)D 变体。
比较有无光流流的结果（注：本工作聚焦于单流输入的3D CNN）。

实验结果

研究问题

RQ1哪些预训练数据集在3D CNN 的标准视频识别基准上转移效果最好？
RQ2预训练中的类别数与实例数如何影响迁移准确性？
RQ3仅仅将公开视频数据集合并以形成更大的预训练集，是否会提升微调性能？
RQ4在不同预训练条件下，增加模型深度如何影响迁移性能？

主要发现

Kinetics-700 预训练在 UCF-101、HMDB-51、ActivityNet 上的单数据集预训练选项中提供最佳迁移性能（顶级1 视频级准确率）。
将 Kinetics-700 与 MiT 合并（K+M）进一步提升微调结果，例如相较于 Kinetics-700 基线在 UCF-101 提升 +0.9，在 HMDB-51 提升 +3.4，在 ActivityNet 提升 +1.1。
更深的 3D-ResNet（如 ResNet-200）受益于 Kinetics-700 与 K+M 预训练，在 UCF-101、HMDB-51、ActivityNet 上达到更高的准确性，而 Kinetics-400 预训练对非常深的模型并不稳定提升。
仅RGB的3D CNNs（及其 2+1D 对应物）在较大、标注良好的数据集上进行预训练时迁移更强；单纯增大数据量并不总是有帮助（可能存在领域不匹配）。
表格结果显示来自预训练选择的具体增益，例如 R3D-50 与 Kinetics-700：UCF-101 92.0，HMDB-51 66.0，ActivityNet 75.9；与 K+M：92.9，69.4，77.0；R(2+1)D-50 与 Kinetics-700：93.4，69.4，78.4。
在此设置中，Kinetics-700 数据集通常优于 MiT 或 STAIR 等其他单数据集。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。