QUICK REVIEW

[论文解读] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Tong Zhan, Yibing Song|arXiv (Cornell University)|Mar 23, 2022

Generative Adversarial Networks and Image Synthesis被引用 431

一句话总结

VideoMAE 显示，使用管状掩蔽的掩码自编码器能够实现面向视频变换器的数据高效自监督预训练，在小数据集上也能获得出色的结果，并且无需额外数据。

ABSTRACT

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.

研究动机与目标

证明使用原生 ViT 主干的掩码自编码可以用于自监督视频预训练（SSVP）并且有效。
设计适用于视频数据的掩蔽策略（tube masking）和重构任务，以避免信息泄露并促进高层次的时空学习。
展示 VideoMAE 能在相对较小的视频数据集上进行训练且无需外部数据，并与对比学习/自监督基线进行比较。
分析掩蔽比、预训练数据的质量/数量以及领域移位对下游迁移性能的影响。

提出的方法

采用 ImageMAE 的掩码自编码器范式，但针对视频进行定制，使用三维立方体（时空） tokenization，以及非常高的掩蔽比（90%–95%）。
使用时序下采样和立方体嵌入来降低时空维数。
实现 tube masking，使相同的掩蔽映射在多帧之间共享，以减少来自时序相关性的信息泄露。
采用非对称的编码器-解码器架构，使用更深的解码器以提升对被掩蔽视频标记的重建。
使用联合的时空 ViT 主干进行训练，并对被掩蔽的 token 重构像素值，使用 MSE 损失。
在 SSVP 基准上对掩蔽策略、重构目标、预训练数据和主干网络进行广泛消融实验。

实验结果

研究问题

RQ1VideoMAE 是否能够在相对较小的数据集上，通过自监督预训练学习到有用的时空表示？
RQ2在 tube masking 下，极高的掩蔽比（90–95%）是否比其他策略能提升数据效率和性能？
RQ3预训练数据的质量、域移和主干选择如何影响向下游视频任务的迁移？

主要发现

VideoMAE 在小数据集上无需额外数据即可取得强结果（例如，在 Kinetics-400 上 87.4%，Something-Something V2 上 75.4%，UCF101 上 91.3%，HMDB51 上 62.6%，使用原生 ViT 主干）。
极高的掩蔽比（90–95%）对视频掩蔽建模有利，因为存在时序冗余。
Tube masking 有助于防止来自时序相关性的信息泄露，并鼓励学习高层次的时空结构。
在 3.5k 条视频上训练的 VideoMAE 仍然有效，突出显示了 SSVP 的数据效率。
在目标视频数据上进行预训练且不使用外部数据，在若干基准上可超越 MoCo v3 和从头开始的训练；域移会影响迁移。
VideoMAE 向 AVA 的迁移在 ViT-B 端上，使用在 Kinetics-400 预训练，达到 26.7 mAP，并且随更大主干/更多数据而提高到更高的 mAP。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。