QUICK REVIEW

[论文解读] ControlVideo: Training-free Controllable Text-to-Video Generation

Yabo Zhang, Yuxiang Wei|arXiv (Cornell University)|May 22, 2023

Generative Adversarial Networks and Image Synthesis被引用 34

一句话总结

ControlVideo 通过采用 fully cross-frame attention、一个 interleaved-frame smoother，以及一个 hierarchical sampler，提供无需训练的可控文本到视频生成框架，在通用 GPU 上实现高质量、时间上连贯的视频。

ABSTRACT

Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a \emph{training-free} framework called extbf{ControlVideo} to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.

研究动机与目标

激励高效视频生成，避免昂贵的时序模型训练。
通过利用预训练的文本到图像模型实现外观一致的视频。
通过基于帧插值的平滑减少结构性抖动。
通过内存高效的层次采样策略支持长视频。

提出的方法

通过在时间轴上膨胀 U-Net，同时保持 ControlNet 辅助分支，将 ControlNet 适配到视频。
通过将所有帧连接成一个更大的时间维度，在自注意力中引入完全的跨帧交互。
增加交错帧平滑器，在选定时间步的三帧片段中通过插值中间帧来去抖动。
实现一个层次采样器，将长视频分割为短片段，并预生成用于长程连贯性的关键帧。
使用 DDIM 采样，50 个时间步和轻量级帧插值（RIFE）进行平滑。

实验结果

研究问题

RQ1无需训练的图像到视频模型的改编是否能够在文本与运动序列条件下实现高质量、时间上连贯的视频生成？
RQ2与仅含首帧或稀疏跨帧机制相比，完全跨帧注意力是否提高了外观连贯性？
RQ3 交错帧平滑器是否在不牺牲帧独立性的前提下减少结构性抖动？
RQ4在普通GPU上通过层次采样器高效生成长视频是否可行？],{
RQ5key_findingsTranslationNoteForArrayEnd
RQ6
RQ7table_headersTranslationNote
RQ8table_headers

主要发现

方法	结构条件	帧一致性（%）	提示一致性（%）
Tune-A-Video	DDIM Inversion	94.53	31.57
Text2Video-Zero	Canny Edge	95.17	30.74
ControlVideo	Canny Edge	96.83	30.75
Text2Video-Zero	Depth Map	95.99	31.69
ControlVideo	Depth Map	97.22	31.81

ControlVideo 在结构和帧质量上对动机提示对比基线表现更好。
深度条件的视频比 canny 条件的视频在时间一致性和文本保真度方面更高。
完全跨帧交互比其他跨帧机制实现更高的帧一致性；加入平滑器进一步提高一致性。
层次采样器使标准GPU上长视频生成具备良好的整体连贯性。
在 RTX 2080Ti 上，短视频（~15 帧）约需 2 分钟，长视频（~100 帧）约 10 分钟。
定性结果显示相比 Tune-A-Video 和 Text2Video-Zero，外观一致性更好、伪影更少。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。