QUICK REVIEW

[论文解读] MagicVideo: Efficient Video Generation With Latent Diffusion Models

Daquan Zhou, Weimin Wang|arXiv (Cornell University)|Nov 20, 2022

Generative Adversarial Networks and Image Synthesis被引用 63

一句话总结

MagicVideo 构建一个潜在扩散视频生成器，配备轻量级帧适配器和定向时序注意力，以在单个 GPU 上以 256x256 的分辨率有效地生成文本条件视频，使用 VideoVAE 和无监督预训练来提升质量。

ABSTRACT

We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips that are concordant with the given text descriptions. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works that directly train video models in the RGB space, we use a pre-trained VAE to map video clips into a low-dimensional latent space and learn the distribution of videos' latent codes via a diffusion model. Besides, we introduce two new designs to adapt the U-Net denoiser trained on image tasks to video data: a frame-wise lightweight adaptor for the image-to-video distribution adjustment and a directed temporal attention module to capture temporal dependencies across frames. Thus, we can exploit the informative weights of convolution operators from a text-to-image model for accelerating video training. To ameliorate the pixel dithering in the generated videos, we also propose a novel VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content. Refer to \url{https://magicvideo.github.io/#} for more examples.

研究动机与目标

解决文本到视频生成中的数据效率和计算成本。
在一个低维潜在空间中建模视频分布以降低计算量。
利用预训练的图像生成权重来加速视频训练。

提出的方法

在低维视频潜在空间中使用潜在扩散来生成16个关键帧。
引入一个带轻量级帧级适配器和定向时序注意力模块的3D U-Net，用以建模时空特征。
用视频分布适配器（2D适配器）替代3D/2+1D卷积，以减少计算并重用图像模型先验。
加入 VideoVAE 解码器以降低帧级抖动伪影。
训练插值网络以在关键帧之间合成中间帧，从而实现更平滑的运动。
应用基于扩散的超分辨模型将 256x256 帧提升到更高分辨率。
采用基于 CLIP 的帧嵌入进行无监督预训练，并在文本-视频对上进行微调。

实验结果

研究问题

RQ1在低维潜在空间中的潜在扩散是否能够有效生成时间上连贯且与文本对齐的视频？
RQ2与传统的3D/2+1D视频模型相比，帧级适配器和定向时序注意力是否提高了质量和时间一致性？
RQ3在视频数据上进行无监督预训练在对文本-视频对进行微调时如何影响最终视频质量？
RQ4VideoVAE解码器对降低视频生成中的抖动伪影有何影响？
RQ5该方法通过 SR 模型进行高分辨率上采样的扩展能力如何？

主要发现

MagicVideo 实现了高质量、时间上连贯且与文本提示对齐的视频生成，在定性比较中优于若干强基线。
定向自注意力机制通过建模单向时序动态来降低 Frechet Video Distance (FVD) 并提升运动一致性。
使用帧级2D卷积的适配器模块在显著降低计算量的同时保持或提升视频质量。
无监督预训练（使用 CLIP-帧嵌入）显著提升视频质量，在跨数据集的消融实验中将FVD降低约60。
在零样本评估中，与基线相比，MagicVideo 在 MSR-VTT 和 UCF-101 上表现出具有竞争力或更优的 FID/FVD 分数（例如 MSR-VTT: FID 36.5, FVD 998；UCF-101: FID 145, FVD 655）。
与时序注意力整合的 VideoVAE 解码器可缓解帧抖动并产生更平滑的 RGB 重建。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。