QUICK REVIEW

[论文解读] End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Antoine Miech, Jean-Baptiste Alayrac|arXiv (Cornell University)|Dec 13, 2019

Human Pose and Action Recognition参考文献 90被引用 39

一句话总结

该论文提出 MIL-NCE，是基于 MIL 的 Noise Contrastive Estimation 目标，能够从未经过滤的叙述性教学视频中学习联合视频-文本表示，无需手动注释，并在多个下游任务中表现出色。

ABSTRACT

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

研究动机与目标

在没有手动注释的前提下，激励从未经过滤的叙述性视频学习鲁棒的视觉表征。
提出 MIL-NCE 以处理视频内容与叙述之间的错配。
从原始像素和 ASR 转录叙述开始，学习一个联合的视频-文本嵌入。
证明所学习的表征在多种下游视频理解任务上具有良好转移性。

提出的方法

定义一个简单的联合嵌入模型，其中 f 将视频片段映射到嵌入，g 将叙述映射到同一嵌入空间。
引入 MIL-NCE 损失，它对每个训练样本的正候选对集合求和并对抗负样本，使得在错配的情况下也能学习。
通过取时间上接近的叙述作为潜在真实描述来构建正候选集合。
用判别性的、基于 Softmax 的 NCE 目标进行训练，使用来自当前 batch 的负样本，将 MIL 扩展并入分子。
比较对称与非对称的负样本采样，并显示当对视频和叙述都采样负样本时的最佳性能。
评估使用 3D CNN 骨干网络（I3D/S3D）和文本模型来形成联合嵌入，在 HowTo100M 上训练，无需人工标签。

实验结果

研究问题

RQ1Can MIL-NCE learn useful joint video-text representations from uncurated narrated videos without manual annotations?
RQ2Does incorporating multiple positive candidates and symmetric negative sampling improve learning under misalignment between video and narration?
RQ3How well do the learned representations perform on a range of downstream tasks (action recognition, retrieval, localization, segmentation) compared to self-supervised and supervised baselines?
RQ4Is a simple language model sufficient for effective text-video matching in this setting?

主要发现

MIL-NCE learns strong video representations from scratch using uncurated instructional videos.
The method outperforms published self-supervised approaches and many fully supervised baselines on multiple tasks across eight datasets.
Using multiple positive narration candidates improves performance over single-instance learning, with best results when using 3–5 positives.
Symmetric sampling of negatives (video and narration) yields better results than asymmetric alternatives.
Joint video-text representations achieve strong text-to-video retrieval performance and state-of-the-art results on some datasets without target-dataset training.
Visual representations trained on HowTo100M generalize well to diverse action recognition and localization benchmarks.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。