QUICK REVIEW

[论文解读] VDT: General-purpose Video Diffusion Transformers via Mask Modeling

Haoyu Lu, Guoxing Yang|arXiv (Cornell University)|May 22, 2023

Generative Adversarial Networks and Image Synthesis被引用 9

一句话总结

介绍 Video Diffusion Transformer（VDT），这是一种纯 Transformer 基于扩散模型的视频生成方法，使用时序/时空注意力以及统一的时空掩模建模机制，来处理无条件生成、预测、插值、动画与完成等任务。

ABSTRACT

This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation. It features transformer blocks with modularized temporal and spatial attention modules to leverage the rich spatial-temporal representation inherited in transformers. We also propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios. VDT offers several appealing benefits. 1) It excels at capturing temporal dependencies to produce temporally consistent video frames and even simulate the physics and dynamics of 3D objects over time. 2) It facilitates flexible conditioning information, \eg, simple concatenation in the token space, effectively unifying different token lengths and modalities. 3) Pairing with our proposed spatial-temporal mask modeling mechanism, it becomes a general-purpose video diffuser for harnessing a range of tasks, including unconditional generation, video prediction, interpolation, animation, and completion, etc. Extensive experiments on these tasks spanning various scenarios, including autonomous driving, natural weather, human action, and physics-based simulation, demonstrate the effectiveness of VDT. Additionally, we present comprehensive studies on how \model handles conditioning information with the mask modeling mechanism, which we believe will benefit future research and advance the field. Project page: https:VDT-2023.github.io

研究动机与目标

开创将 Transformer 作为扩散式视频生成骨干的方法。
开发一套统一的时空掩模建模机制，以支持多样化的视频任务。
在视频生成中实现灵活的条件化以及对不同输入长度/模态的高效处理。
在多个数据集上展示在无条件生成、视频预测、插值与完成等任务上的强劲性能。

提出的方法

在每个 VDT 块内采用纯 Transformer 架构，具备时序与时空注意力。
通过预训练的 VAE 分词器将视频投射到潜在空间以降低计算量。
引入时序位置嵌入和时空位置嵌入以学习时空信息。
应用自适应分组归一化将扩散时间信息注入 Transformer 块中。
探索用于视频预测的条件化方案，包括自适应层归一化、交叉注意力和标记拼接。
引入统一的时空掩模建模机制，以将条件帧与噪声混合用于多任务。

实验结果

研究问题

RQ1基于 Transformer 的扩散模型是否能够有效捕捉时间依赖，以实现高品质、时序一致性的视频生成？
RQ2如何在多样化的视频生成任务（无条件生成、预测、插值、动画、完成）中，统一并有效地引入条件信息？
RQ3统一的时空掩模建模方法是否能够在不改变架构的情况下，使单一模型完成多种视频生成任务？
RQ4哪种条件化策略在视频预测中能实现最佳的收敛速度与样本质量？
RQ5在标准视频基准上，VDT 相较于最先进的扩散和其他生成方法表现如何？

主要发现

VDT 在无条件视频生成（如 UCF-101）方面达到与最先进方法竞争甚至优越的性能，并且优于若干 GAN 与扩散基线。
在视频预测中，标记拼接作为条件化策略在所评估的方案中实现最快的收敛和最佳样本质量（FVD/SSIM）。
统一的时空掩模建模使 VDT 能在同一框架内处理多样化任务（无条件生成、双向预测、任意插值、图像到视频生成，以及时空完成）。
VDT 展示出强大的时序建模，在 Cityscapes 和 Physion 数据集的 video prediction 上达到或超过基线，并保持颜色一致性与运动连贯性。
在 Physion 上，VDT 在物理碰撞预测的 VQA 准确率方面超过以场景为中心的方法（65.3% 对比最高63.1%），显示出稳健的物理视频预测能力。
先进行图像预训练再进行联合训练，相比端到端的时空训练，提高了效率和性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。