QUICK REVIEW

[论文解读] Watching the World Go By: Representation Learning from Unlabeled Videos

Daniel Gordon, Kiana Ehsani|arXiv (Cornell University)|Mar 18, 2020

Human Pose and Action Recognition参考文献 41被引用 38

一句话总结

VINCE 通过多帧、多对比的噪声对比学习，从未标注视频中学习图像表示，在若干时序与非时序任务上优于 MoCo 和 ImageNet 监督预训练。

ABSTRACT

Recent single image unsupervised representation learning techniques show remarkable success on a variety of tasks. The basic principle in these works is instance discrimination: learning to differentiate between two augmented versions of the same image and a large batch of unrelated images. Networks learn to ignore the augmentation noise and extract semantically meaningful representations. Prior work uses artificial data augmentation techniques such as cropping, and color jitter which can only affect the image in superficial ways and are not aligned with how objects actually change e.g. occlusion, deformation, viewpoint change. In this paper, we argue that videos offer this natural augmentation for free. Videos can provide entirely new views of objects, show deformation, and even connect semantically similar but visually distinct concepts. We propose Video Noise Contrastive Estimation, a method for using unlabeled video to learn strong, transferable single image representations. We demonstrate improvements over recent unsupervised single image techniques, as well as over fully supervised ImageNet pretraining, across a variety of temporal and non-temporal tasks. Code and the Random Related Video Views dataset are available at https://www.github.com/danielgordon10/vince

研究动机与目标

通过利用自然的视频变化（遮挡、形变、视角）来推动超越单张图像扩增的表示学习。
提出一个自监督框架，使用未标注视频来学习可迁移的图像表示。
展示基于视频的对比学习在多样化任务上可超越近来的一些无监督图像方法和经过 ImageNet 监督预训练的方法。
Demonstrate the effectiveness of Random Related Video Views (R2V2) as a scalable unlabeled video dataset for pretraining.
Evaluate the learned representations on a range of tasks including image classification, scene classification, action recognition, and object tracking.

提出的方法

提出 Video Noise Contrastive Estimation (VINCE)，其学习的是两张图像是否来自同一视频而非来自同一张图像。
通过从同一视频中采样多帧来形成锚点–正例关系，使用多帧正样本。
用内存库和动量（MoCo）扩展 Noise Contrastive Estimation，以支持大量负样本并实现稳定学习。
应用 Multi-Pair NCE 通过将来自多帧/多视频的正样本分组，并使用块对角遮罩策略（Algorithm 1）来增加每个批次的正样本数量。
构建 Random Related Video Views (R2V2)：~960k 帧，来自 ~240k 个未经整理的视频，通过每个视频采四帧，间隔 ~5s，使用与 ImageNet 类查询相关联的 YouTube CC 视频实现语义多样性。
通过冻结表示并对每个任务训练轻量级分类器（线性分类器、LSTM+线性）在下游任务上评估 VINCE。

实验结果

研究问题

RQ1未标注视频是否能提供监督信号，从而产生超越单张图像扩增所能实现的可迁移图像表示？
RQ2多帧、多对比学习是否提升了学习到的表示在语义一致性和时序理解方面的表现？
RQ3VINCE 相对于基于 MoCo 的方法和 ImageNet 监督预训练，在图像、场景、动作和跟踪任务上的表现如何？
RQ4预训练数据源（R2V2 vs YouTube8M vs Kinetics）对下游任务性能有何影响？

主要发现

测试任务	ImageNet	SUN Scene	Kinetics 400	OTB 2015 Precision	OTB 2015 Success
同一帧	0.358	0.450	0.318	0.555	0.403
多帧	0.381	0.478	0.361	0.622	0.464
多帧多对	0.400	0.495	0.362	0.629	0.465

VINCE 在若干任务上相对于基线的 MoCo 和经过 ImageNet 监督预训练的方法表现出改进。
在 ImageNet 和 SUN Scene 数据集上，VINCE 的表现优于 MoCo-R2V2，显示对场景级语义的更好泛化。
在 Kinetics 400（动作识别）上，VINCE 展现出强烈的时序性能，超过时序基线。
VINCE 在对象跟踪（OTB 2015）上提供稳健的增益，尤其在使用多帧多对样本设置时改进显著。
消融表明，使用多帧输入与多对 NCE 相较于标准单帧 NCE 能显著提升性能，在更具语义性的任务上获得更大收益。
预训练数据源更具影响力：R2V2（基于 ImageNet 查询）在 ImageNet 上表现最佳，而 YouTube8M 的 URL 在跟踪任务上提供更广泛的增益，Kinetics 的 URL 则带来与动力学相关的强劲性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。