QUICK REVIEW

[论文解读] WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Xiaofeng Wang, Zheng Zhu|arXiv (Cornell University)|Jan 18, 2024

Generative Adversarial Networks and Image Synthesis被引用 5

一句话总结

WorldDreamer 在变换器框架内通过预测被遮罩的视觉标记来训练用于视频生成的通用世界模型，从而实现文本到视频、图像到视频、编辑以及基于动作条件的视频生成，覆盖多样场景。

ABSTRACT

World models play a crucial role in understanding and predicting the dynamics of the world, which is essential for video generation. However, existing world models are confined to specific scenarios such as gaming or driving, limiting their ability to capture the complexity of general world dynamic environments. Therefore, we introduce WorldDreamer, a pioneering world model to foster a comprehensive comprehension of general world physics and motions, which significantly enhances the capabilities of video generation. Drawing inspiration from the success of large language models, WorldDreamer frames world modeling as an unsupervised visual sequence modeling challenge. This is achieved by mapping visual inputs to discrete tokens and predicting the masked ones. During this process, we incorporate multi-modal prompts to facilitate interaction within the world model. Our experiments show that WorldDreamer excels in generating videos across different scenarios, including natural scenes and driving environments. WorldDreamer showcases versatility in executing tasks such as text-to-video conversion, image-tovideo synthesis, and video editing. These results underscore WorldDreamer's effectiveness in capturing dynamic elements within diverse general world environments.

研究动机与目标

激发对能够处理超越游戏/机器人领域的多样现实世界动态的通用世界模型需求。
提出一种受大型语言模型启发的视频建模标记预测范式。
开发时空分块变换器（STPT）以高效学习视频中的运动与物理规律。
使多模态提示（文本和动作）可用于引导视频生成与编辑。
展示在自然场景、驾驶场景以及多种生成/编辑任务中的多样性与适用性。

提出的方法

使用 VQGAN 将视觉数据编码为离散标记，并对遮罩标记进行预测。
用 T5 语义嵌入表示文本，用 MLP 表示动作，形成多模态提示。
使用 Spatial Temporal Patchwise Transformer（STPT）在局部时空补丁内进行注意力计算，并对多模态提示应用跨模态注意力。
采用余弦调度的动态遮罩策略进行训练，以实现并行标记预测并降低信息泄露。
使用交叉熵损失进行优化，以在未遮罩标记和多模态提示的条件下预测被遮罩的标记。
在自采数据和 nuScenes 数据上使用完整的 STPT 参数进行微调，以提升时空理解能力。

实验结果

研究问题

RQ1能否从视觉标记学习的通用世界模型在不同现实场景中预测动力学和物理规律？
RQ2在整合多模态提示（文本和动作）的同时，STPT 在捕捉时空动力学方面有多有效？
RQ3该模型是否能够支持文本到视频、图像到视频、修复/填充、风格化和动作到视频等多种生成/编辑任务？
RQ4并行遮罩标记预测是否在速度和质量上优于扩散/自回归方法？

主要发现

WorldDreamer 在自然场景和驾驶场景下生成视频。
模型支持文本到视频、图像到视频、视频编辑以及动作到视频生成。
在联合图像/视频数据和多模态提示下进行训练可提升时空理解。
推理阶段使用并行遮罩标记预测，解码速度约比扩散基方法快3倍。
推理阶段的 CFG 指导可以提升生成质量。
在单个 A800 GPU 上，192x320、24 帧，3 秒生成。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。