QUICK REVIEW

[论文解读] Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Levon Khachatryan, Andranik Movsisyan|arXiv (Cornell University)|Mar 23, 2023

Generative Adversarial Networks and Image Synthesis被引用 7

一句话总结

Text2Video-Zero 在不进行任何训练的前提下，通过修改预训练文本到图像扩散模型以引入运动动力学和跨帧注意力，从文本提示生成时序一致的视频。

ABSTRACT

Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .

研究动机与目标

将零样本文本到视频生成作为一个无需训练的任务引入。
利用预训练的文本到图像扩散模型来合成视频序列。
通过潜在编码中的运动动力学和跨帧注意力实现时间一致性。
展示在条件化、专业化视频生成和视频编辑等方面的广泛适用性。

提出的方法

通过在帧的潜在编码中加入运动动力学以使全局场景/背景随时间对齐。
应用跨帧注意力，使每一帧关注第一帧以保持前景身份。
使用运动场对潜在表示在帧之间进行扭曲，然后重新运行前向扩散以获得运动自由度。
用跨帧注意力替代 Stable Diffusion 中的自注意力层，以在帧之间保持一致性。
可选地通过前景掩模引导的背景代码以前景掩模的帧潜在编码与被扭曲的第一帧潜在编码的凸组合来实现背景平滑。
展示与 ControlNet 和 DreamBooth 模型在条件性/专业化生成以及与 Video Instruct-Pix2Pix 的指令引导编辑的兼容性。
在修改后的潜在变量上使用 DDIM 采样来生成视频序列。

实验结果

研究问题

RQ1在不对视频数据进行训练或微调的情况下，是否可以实现零样本文本到视频生成？
RQ2运动动力学潜在编码与跨帧注意力是否提高视频生成中的时间一致性和前景身份保留？
RQ3是否有可能在不额外训练的情况下将零样本视频生成扩展到条件化、专业化和指令引导的编辑场景？
RQ4与现有文本到视频方法相比，所提方法在对齐提示和时间稳定性方面表现如何？

主要发现

该方法在无需训练的情况下实现了从文本提示到时序一致的视频生成。
潜在编码中的运动动力学提升了全局场景/背景的时间一致性。
跨帧注意力在跨帧保持前景外观和身份方面表现良好。
在基于 CLIP 的对齐方面，该方法与 CogVideo 的对齐度具有竞争力（31.19 对 29.63）。
它使得条件化和专业化视频生成以及 Video Instruct-Pix2Pix 的重新训练需求得以省略。
定性结果显示在各种提示和引导下具有高度的文本-视频对齐和时间一致性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。