QUICK REVIEW

[论文解读] Towards Good Practices for Very Deep Two-Stream ConvNets

Limin Wang, Yuanjun Xiong|arXiv (Cornell University)|Jul 8, 2015

Human Pose and Action Recognition参考文献 19被引用 385

一句话总结

该论文通过将深度ImageNet架构（GoogLeNet、VGGNet）适配到视频领域，并采用专门的训练策略以缓解小数据集上的过拟合问题，提出了用于视频动作识别的非常深的双流ConvNets。通过使用预训练、数据增强、低学习率和高丢弃率，其在UCF101数据集上取得了91.4%的新SOTA准确率。

ABSTRACT

Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet, GoogLeNet), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely small compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. However, this extension is not easy as the size of action recognition is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) smaller learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high computational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of $91.4\%$.

研究动机与目标

解决由于网络架构浅层化和训练数据集较小导致的深度ConvNets在视频动作识别中性能受限的问题。
通过为非常深的双流网络设计有效的训练策略，克服在小视频数据集上的过拟合问题。
扩展Caffe工具箱，实现高效且内存消耗低的多GPU训练，以支持视频任务中大规模深度学习。
通过结合非常深的网络架构与稳健的训练策略，在UCF101上实现最先进性能。

提出的方法

通过将网络应用于空间流和时间流，将非常深的ImageNet架构（GoogLeNet和VGGNet）适配到视频领域。
对空间流和时间流网络均在ImageNet上进行预训练，以改善模型初始化和泛化能力。
采用更小的学习率和更高的丢弃率，以减少在小视频数据集上训练时的过拟合。
应用广泛的数据增强技术，以提升有效训练数据的多样性与鲁棒性。
实现一个计算效率高且内存消耗低的多GPU版本Caffe，以支持可扩展的训练。
融合策略采用加权线性组合方式结合空间流和时间流的预测结果（时间流与空间流的权重比为2:1）。

实验结果

研究问题

RQ1当从图像分类模型适配时，非常深的双流ConvNets是否能在动作识别中实现更优性能？
RQ2在像UCF101这样的小视频数据集上训练非常深的网络时，需要哪些特定的训练策略来防止过拟合？
RQ3网络深度与训练方法如何共同影响视频动作识别中的识别准确率？
RQ4预训练、数据增强和正则化在有限视频数据集上的性能提升程度如何？
RQ5Caffe深度学习框架能否被有效扩展以支持非常深双流网络的高效多GPU训练？

主要发现

所提出的非常深双流ConvNets在UCF101数据集上实现了91.4%的SOTA识别准确率。
VGGNet-16在空间流上比浅层架构（如ClarifaiNet、GoogLeNet）高出约5%，在时间流上高出约4%。
非常深的双流网络相比原始双流ConvNets提升了3.4%的准确率，证明了增加深度的优势。
与先前方法（如TDD+FV的90.3%）相比，本方法的性能差距为1.1%，证实了其优越性。
在THUMOS15数据集中，若不采用所提出的良好训练实践，更深的模型无法实现泛化，表明训练策略对成功至关重要。
多GPU版本的Caffe实现支持高效训练且内存消耗低，为视频任务中的大规模深度学习提供了支持。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。