QUICK REVIEW

[论文解读] VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

Tsu-Jui Fu, Linjie Li|arXiv (Cornell University)|Nov 24, 2021

Multimodal Machine Learning Applications参考文献 84被引用 89

一句话总结

VIOLET 引入了一个完全端到端的基于 Video Swin Transformer 的 VidL 模型，具有新颖的 Masked Visual-token Modeling 预训练任务，在多个人文本到视频检索和视频问答基准上取得了最先进的结果，同时显式地建模时间动态。

ABSTRACT

A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. Recent studies try to mitigate this disconnection via end-to-end training. To make it computationally feasible, prior works tend to "imagify" video inputs, i.e., a handful of sparsely sampled frames are fed into a 2D CNN, followed by a simple mean-pooling or concatenation to obtain the overall video representations. Although achieving promising results, such simple approaches may lose temporal information that is essential for performing downstream VidL tasks. In this work, we present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs. Further, unlike previous studies that found pre-training tasks on video inputs (e.g., masked frame modeling) not very effective, we design a new pre-training task, Masked Visual-token Modeling (MVM), for better video modeling. Specifically, the original video frame patches are "tokenized" into discrete visual tokens, and the goal is to recover the original visual tokens based on the masked patches. Comprehensive analysis demonstrates the effectiveness of both explicit temporal modeling via video transformer and MVM. As a result, VIOLET achieves new state-of-the-art performance on 5 video question answering tasks and 4 text-to-video retrieval tasks.

研究动机与目标

推动端到端的 VidL 建模，以克服与固定视频表示的脱节。
使用 Video Swin Transformer 显式建模时间动态。
引入 Masked Visual-token Modeling (MVM)，以学习视频补丁的离散视觉标记。
证明将 VT 与跨模态学习及 MVM 相结合可以提升下游 VidL 任务。
在多个视频 QA 与检索基准上展示最先进的结果。

提出的方法

使用 Video Swin Transformer 对稀疏采样的视频帧进行显式的时空建模进行编码。
使用 Language Embedder 处理文本输入，并使用 Cross-modal Transformer 将视频与文本模态融合。
使用三个任务进行预训练：Masked Language Modeling (MLM)、Visual-Text Matching (VTM) 和 Masked Visual-token Modeling (MVM)。
MVM 通过离散 VAE (dVAE) 将帧标记化，并预测被遮蔽的视觉标记以重建原始补丁。
采用分块遮蔽和关注遮蔽，通过聚焦显著的标记/补丁来增强 MLM 和 MVM 信号。
在图像-文本和视频-文本数据上进行端到端训练，课程包括 YT-Temporal、WebVid 和 ConceptualCaptions 数据集。

实验结果

研究问题

RQ1通过视频变换器的显式时序建模是否优于简单的均值池化或对帧特征的拼接在 VidL 任务上的表现？
RQ2Masked Visual-token Modeling (MVM) 相对于先前的视觉遮蔽策略（MRM/MFM）在视频语言预训练中是否提供可测量的提升？
RQ3在图像-文本和视频-文本数据上的联合预训练如何影响文本到视频检索和视频 QA 的性能？
RQ4不同预训练数据（WebVid、CC、YT-Temporal）对下游 VidL 任务有何影响？
RQ5端到端 VidL 训练配合 MVM 能否在多个基准上达到最先进的结果？

主要发现

VIOLET 在多个人文本到视频检索基准和视频 QA 数据集上取得新的最先进结果。
使用 Video Swin Transformer 的显式时序建模相较于均值池化和帧拼接基线带来持续的提升。
Masked Visual-token Modeling (MVM) 在检索和问答任务上显著提升下游 VidL 性能，相较于 MRM/MFM 或基于 MLM 的视觉遮蔽。
在 WebVid+CC 和 YT-Temporal 数据上的预训练提供了强大的提升，其中 WebVid+CC 提供了稳健的跨模态学习信号。
即使在相对适中的计算资源和帧分辨率下，使用 MVM 的端到端训练也有益处，提供与更大规模方法竞争的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。