QUICK REVIEW

[论文解读] HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech, Dimitri Zhukov|arXiv (Cornell University)|Jun 7, 2019

Multimodal Machine Learning Applications参考文献 68被引用 878

一句话总结

本文提出 HowTo100M，这是一个包含 136M 条叙述片段的大规模视频-语言数据集，以及使用最大边际排序损失训练的联合文本-视频嵌入，在 instructional 数据集上达到最先进的结果，并在通用视频领域具有强泛化能力。

ABSTRACT

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

研究动机与目标

通过利用教学视频的自动转录解说，激励学习一个鲁棒的文本-视频嵌入，而无需人工字幕。
创建一个可扩展的、弱监督的数据集（HowTo100M），用于训练一个联合的视频-文本表示。
证明所得嵌入在教学数据集上实现强文本检索和动作定位，并能转移到非教学领域。
展示数据规模和采样策略对性能具有关键影响。

提出的方法

学习一个联合嵌入，将视频和字幕特征映射到共享的 4,096 维空间，使用非线性门控投影（受之前工作启发）。
使用最大边距排序损失进行优化，将正确的视频-字幕对拉近，将负样本分离，同时使用视频内负采样以聚焦相关内容。
用时序最大池化的 2D/3D CNN 特征表示视频片段，用基于词嵌入的浅层文本 CNN 表示字幕。
在 HowTo100M 上端到端训练，使用 Adam，设定固定边距以鼓励正确和不正确对之间的判别。
研究负采样策略和训练数据规模对下游任务的影响。

实验结果

研究问题

RQ1一个大规模、自动配对的文本-视频数据集是否能够在不依赖人工注释字幕的情况下学习到强大的联合嵌入？
RQ2HowTo100M 预训练如何影响在教学数据集上的文本检索和动作定位，以及对 YouTube/LSMDC 的跨域转移？

主要发现

在 HowTo100M 上训练的联合文本-视频嵌入在教学数据集（CrossTask、YouCook2）上实现了最先进的文本定位和检索。
经过 HowTo100M 预训练的嵌入在对非教学领域（MSR-VTT、LSMDC）的微调中正向转移，且优于在这些数据集上从头训练的模型。
视频内负采样显著提升检索和定位性能，特别是在细粒度的教学数据集上。
规模重要：增加 HowTo100M 数据量会持续带来提升，未观察到饱和，表明更多数据可能进一步改善结果。
在目标数据集上对预训练模型进行微调可获得显著提升，在某些任务上甚至超过完全监督基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。