QUICK REVIEW

[论文解读] Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

Ali Diba, Mohsen Fayyaz|arXiv (Cornell University)|Nov 22, 2017

Human Pose and Action Recognition参考文献 30被引用 187

一句话总结

本文提出 Temporal 3D ConvNets (T3D) 及 Temporal Transition Layer (TTL) 来捕捉多尺度时间动态，扩展 DenseNet 为 DenseNet3D，并提出一种将 2D 到 3D 的监督传输，以实现稳定的权重初始化并在数据有限时获得更好性能。它在 HMDB51 和 UCF101 上达到最先进的结果，在 Kinetics 上具有竞争力。

ABSTRACT

The work in this paper is driven by the question how to exploit the temporal cues available in videos for their accurate classification, and for human action recognition in particular? Thus far, the vision community has focused on spatio-temporal approaches with fixed temporal convolution kernel depths. We introduce a new temporal layer that models variable temporal convolution kernel depths. We embed this new temporal layer in our proposed 3D CNN. We extend the DenseNet architecture - which normally is 2D - with 3D filters and pooling kernels. We name our proposed video convolutional network `Temporal 3D ConvNet'~(T3D) and its new temporal layer `Temporal Transition Layer'~(TTL). Our experiments show that T3D outperforms the current state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D ConvNets is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D ConvNets is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by finetuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and finetuned on the target datasets, e.g. HMDB51/UCF101. The T3D codes will be released

研究动机与目标

推动在视频中利用时间线索以提升动作识别。
开发能够在 3D CNN 中建模可变时序深度的架构。
通过新颖的 TTL 将 DenseNet 扩展为 DenseNet3D，以捕捉短期、中期和长期的时序动态。
引入一种跨结构的迁移学习方法，将预训练的 2D CNN 迁移到随机初始化的 3D CNN，以简化训练。
在 HMDB51、UCF101 和 Kinetics 上进行评估，以展示性能与可迁移性。

提出的方法

引入 Temporal Transition Layer (TTL)，在 3D 卷积框架中把来自多种时间深度的特征拼接起来。
通过在密集连接块中使用 3D 过滤器和池化核，将 DenseNet 扩展为 DenseNet3D。
将 TTL 融入 DenseNet3D 以形成 Temporal 3D ConvNets (T3D)，用于学习短期、中期和长期的时间动态。
提出从预训练的 2D CNN (ImageNet) 到随机初始化的 3D CNN 的监督传输，通过图像-视频对应任务对齐图像-视频帧/片段对以实现。
在 Kinetics 上从零开始训练 T3D 并在目标数据集 (UCF101、HMDB51) 上微调；与仅使用 RGB 输入的其他 3D CNN 进行比较。
证明 2D→3D 迁移策略提供稳定的权重初始化，并在小数据集上提升数据高效学习。

实验结果

研究问题

RQ1在不使用固定内核深度的情况下，3D CNN 能否捕捉远距离的时序信息？
RQ2具有可变深度内核的时序过渡层是否比固定深度的 3D 卷积在动作识别上更优？
RQ32D CNN 学到的知识能否迁移到 3D CNN，以减少对大规模已标注视频数据集的需求？
RQ4在 HMDB51、UCF101 和 Kinetics 上，T3D 相对于最先进的 3D ConvNets 的表现如何？
RQ5哪些输入配置（帧率、分辨率）最适合支撑 3D 视频体系结构？

主要发现

T3D 搭配 TTL 在 HMDB51 和 UCF101 上优于最先进的 3D ConvNets，在 Kinetics 上具有竞争力。
一个 2D 预训练的 CNN 可以充当教师，为随机初始化的 3D CNN 提供稳定初始化，从而在没有大规模视频数据集的情况下实现有效迁移学习。
在 UCF101 上从零开始训练时，T3D 与 TTL 的准确率高于 DenseNet3D 及其他 3D 架构。
帧分辨率和采样率显著影响性能；224x224 的帧和步幅为 2 的配置比更小的帧或更大的步幅效果更好。
迁移学习 (2D→3D) 在 UCF101 和 HMDB51 上提升性能，达到或超过在大规模视频数据集上训练并微调到目标数据集的模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。