QUICK REVIEW

[论文解读] Transfer Learning for Video Recognition with Scarce Training Data.

Yu-Chuan Su, Tzu-Hsuan Chiu|arXiv (Cornell University)|Sep 15, 2014

Human Pose and Action Recognition参考文献 25被引用 4

一句话总结

本文提出从弱标签图像数据集进行迁移学习以实现视频识别，仅需4,000个标注视频即可实现鲁棒的基于帧的视频分类。通过利用预训练的图像特征，该方法减少了过拟合，并在无需大规模视频标注的情况下实现了高准确率，显著降低了视频识别中深度学习的数据需求。

ABSTRACT

Abstract—Unconstrained video recognition and Deep Convo-lution Network (DCN) are two active topics in computer vision recently. In this work, we apply DCNs as frame-based recognizers for video recognition. Our preliminary studies, however, show that video corpora with complete ground truth are usually not large and diverse enough to learn a robust model. The networks trained directly on the video data set suffer from significant overfitting and have poor recognition rate on the test set. The same lack-of-training-sample problem limits the usage of deep models on a wide range of computer vision problems where obtaining training data are difficult. To overcome the problem, we perform transfer learning from images to videos to utilize the knowledge in the weakly labeled image corpus for video recognition. The image corpus help to learn important visual patterns for natural images, while these patterns are ignored by models trained only on the video corpus. Therefore, the resultant networks have better generalizability and better recognition rate. We show that by means of transfer learning from image to video, we can learn a frame-based recognizer with only 4k videos. Because the image corpus is weakly labeled, the entire learning process requires only 4k annotated instances, which is far less than the million scale image data sets required by previous works. The same approach may be applied to other visual recognition tasks where only scarce training data is available, and it improves the applicability of DCNs in various computer vision problems. Our experiments also reveal the correlation between meta-parameters and the performance of DCNs, given the properties of the target problem and data. These results lead to a heuristic for meta-parameter selection for future researches, which does not rely on the time consuming meta-parameter search.

研究动机与目标

解决因视频数据集有限且缺乏多样性而导致深度视频识别模型过拟合的问题。
通过从大规模弱标签图像数据集迁移知识，克服视频识别中的数据稀缺问题。
开发一种迁移学习框架，仅使用极少的视频标注即可提升模型的泛化能力和测试性能。
使深度卷积网络（DCNs）能够在大规模视频标注难以获取的视频识别任务中得以应用。
提出一种元超参数选择的启发式方法，减少对耗时的超参数调优的依赖。

提出的方法

微调一个最初在大规模图像数据集（如ImageNet）上预训练的深度卷积网络（DCN），用于视频帧分类。
将预训练的图像特征用作视频识别的强初始化，以捕捉自然图像中的通用视觉模式。
在仅4,000个标注视频的小型视频数据集上端到端训练网络，无需额外的数据增强或强监督。
利用图像语料库的弱标签特性，避免对大规模视频标注数据集的需求。
应用迁移学习，将图像中的视觉知识迁移到视频中，从而在低数据场景下提升特征表示能力。
通过经验分析推导出基于数据和问题特性的元超参数选择启发式方法，避免进行耗时的穷举搜索。

实验结果

研究问题

RQ1当仅有少量视频样本时，从图像数据集进行迁移学习是否能显著提升视频识别性能？
RQ2与从零开始训练相比，基于弱标签图像数据的预训练在视频识别中如何提升DCN的泛化能力？
RQ3不同元超参数（如初始学习率、权重衰减）对低数据量视频识别设置下的DCN性能有何影响？
RQ4能否基于经验结果推导出一种元超参数选择的启发式方法，而无需依赖计算成本高昂的超参数搜索？
RQ5当视频数据有限且多样化时，从图像中学到的视觉模式在多大程度上能提升视频数据的识别准确率？

主要发现

从图像数据集进行迁移学习可仅用4,000个标注视频训练出鲁棒的基于帧的视频识别器，显著降低数据需求。
由于迁移了学习到的视觉模式，该模型在相同小规模视频数据集上的泛化能力和测试准确率均优于从零开始训练的模型。
使用弱标签图像数据消除了对大规模全标注视频数据集的需求，使该方法在低资源领域具有可扩展性。
所提方法优于以往需要数百万张标注图像的方法，证明了其在低数据场景下的高效性与有效性。
基于经验结果推导出元超参数选择的启发式方法，减少了未来视频识别任务中对耗时超参数调优的需求。
研究揭示了元超参数与模型性能之间存在强相关性，从而可为类似低数据问题提供数据驱动的配置策略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。