QUICK REVIEW

[论文解读] Transfer Learning for Video Recognition with Scarce Training Data for Deep Convolutional Neural Network

Yu-Chuan Su, Tzu-Hsuan Chiu|arXiv (Cornell University)|Sep 15, 2014

Domain Adaptation and Few-Shot Learning参考文献 40被引用 24

一句话总结

本文提出从弱标签图像数据集迁移学习，以训练深度卷积网络（DCNs）进行视频识别，仅使用有限的视频训练数据。通过使用预训练的图像模型初始化DCNs，并仅在4,000个标注视频上微调全连接层，该方法在最小化人工标注工作量的同时实现了优异性能，表明即使在数据稀缺、弱监督的情况下，迁移学习也能实现有效的视频识别。

ABSTRACT

Unconstrained video recognition and Deep Convolution Network (DCN) are two active topics in computer vision recently. In this work, we apply DCNs as frame-based recognizers for video recognition. Our preliminary studies, however, show that video corpora with complete ground truth are usually not large and diverse enough to learn a robust model. The networks trained directly on the video data set suffer from significant overfitting and have poor recognition rate on the test set. The same lack-of-training-sample problem limits the usage of deep models on a wide range of computer vision problems where obtaining training data are difficult. To overcome the problem, we perform transfer learning from images to videos to utilize the knowledge in the weakly labeled image corpus for video recognition. The image corpus help to learn important visual patterns for natural images, while these patterns are ignored by models trained only on the video corpus. Therefore, the resultant networks have better generalizability and better recognition rate. We show that by means of transfer learning from image to video, we can learn a frame-based recognizer with only 4k videos. Because the image corpus is weakly labeled, the entire learning process requires only 4k annotated instances, which is far less than the million scale image data sets required by previous works. The same approach may be applied to other visual recognition tasks where only scarce training data is available, and it improves the applicability of DCNs in various computer vision problems. Our experiments also reveal the correlation between meta-parameters and the performance of DCNs, given the properties of the target problem and data. These results lead to a heuristic for meta-parameter selection for future researches, which does not rely on the time consuming meta-parameter search.

研究动机与目标

解决用于训练深度卷积网络（DCNs）的标注视频数据不足的问题，该问题会导致严重过拟合。
通过利用弱标签图像语料库，克服帧级或像素级视频标注的高昂成本。
通过从大规模图像数据集中迁移学习到的视觉模式，使用最少的人工标注视频数据实现有效的视频识别。
研究网络深度和输入分辨率等元参数在低数据环境下对DCN性能的影响。
证明即使图像和视频领域存在差异，从图像到视频的迁移学习仍能提升泛化能力和识别准确率。

提出的方法

在大规模弱标签图像语料库（如Yahoo!-Flickr或ILSVRC2012）上预训练DCN，以学习通用视觉特征。
使用预训练的图像模型权重初始化视频识别网络，将学习到的卷积滤波器迁移到视频任务中。
仅在小规模视频数据集（4,000个视频）上微调DCN的全连接层，同时保持卷积层冻结，以防止过拟合。
结合来自多个图像源（如Yahoo!-Flickr和ILSVRC2012）的迁移学习，进一步提升视频识别性能。
将从视频片段中提取的帧级特征作为DCN的输入，将每一帧视为图像进行识别。
通过在CCV视频数据集上进行消融研究，评估网络深度和输入分辨率对性能的影响。

实验结果

研究问题

RQ1当仅有少量视频样本时，从弱标签图像数据集迁移学习是否能提升视频识别性能？
RQ2在低数据环境下，仅微调全连接层而保持卷积层冻结，是否比端到端微调具有更好的泛化能力？
RQ3预训练数据集的选择（如Yahoo!-Flickr与ILSVRC2012）如何影响最终视频识别器的性能？
RQ4网络深度和输入分辨率对在有限训练数据下DCN的视频识别性能有何影响？
RQ5尽管存在领域差异，从弱监督图像进行迁移学习是否仍能在视频任务上取得强性能？

主要发现

仅微调全连接层而保持卷积层冻结，能显著减少过拟合并提升识别准确率，尤其在训练数据有限时效果更明显。
该方法仅使用4,000个标注视频即实现优异性能，表明迁移学习使在数据稀缺条件下训练DCN成为可能。
在弱标签图像数据集（如Yahoo!-Flickr）上预训练，可提升视频识别性能，且无需对视频数据进行昂贵的人工标注。
与Yahoo!-Flickr相比，ILSVRC2012数据集因标签更精确，提供了更强的监督信号，尤其在使用更深网络时性能更优。
结合来自多个图像源（如Yahoo!-Flickr和ILSVRC2012）的预训练，可进一步提升视频数据集上的识别准确率。
高分辨率输入始终带来更好性能，但其优势在物体级识别中比场景级识别更显著。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。