QUICK REVIEW

[论文解读] VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet

Zhihao Hu, Dong Xu|arXiv (Cornell University)|Jul 26, 2023

Generative Adversarial Networks and Image Synthesis被引用 12

一句话总结

VideoControlNet 使用扩散模型结合 ControlNet 与运动信息，将输入视频翻译成多样化提示，同时在保持内容一致性的前提下，通过用 ControlNet 生成 I-帧，用 MgPG 生成 P-帧，用 MgBI 生成 B-帧来实现。

ABSTRACT

Recently, diffusion models like StableDiffusion have achieved impressive image generation results. However, the generation process of such diffusion models is uncontrollable, which makes it hard to generate videos with continuous and consistent content. In this work, by using the diffusion model with ControlNet, we proposed a new motion-guided video-to-video translation framework called VideoControlNet to generate various videos based on the given prompts and the condition from the input video. Inspired by the video codecs that use motion information for reducing temporal redundancy, our framework uses motion information to prevent the regeneration of the redundant areas for content consistency. Specifically, we generate the first frame (i.e., the I-frame) by using the diffusion model with ControlNet. Then we generate other key frames (i.e., the P-frame) based on the previous I/P-frame by using our newly proposed motion-guided P-frame generation (MgPG) method, in which the P-frames are generated based on the motion information and the occlusion areas are inpainted by using the diffusion model. Finally, the rest frames (i.e., the B-frame) are generated by using our motion-guided B-frame interpolation (MgBI) module. Our experiments demonstrate that our proposed VideoControlNet inherits the generation capability of the pre-trained large diffusion model and extends the image diffusion model to the video diffusion model by using motion information. More results are provided at our project page.

研究动机与目标

激励基于扩散的免费视频生成，并解决逐帧方法中的时间一致性问题。
利用 ControlNet 将扩散模型条件化于输入视频内容。
引入基于运动引导的 P-帧生成（MgPG）和基于运动引导的 B-帧插值（MgBI），以减少时间冗余。
证明 VideoControlNet 继承 StableDiffusion 的能力，同时在跨提示下提供连贯的视频翻译。

提出的方法

使用条件于输入帧派生条件图像的 ControlNet 的 StableDiffusion 生成 I-帧。
将后续帧划分为 GoP；通过 MgPG 使用光流进行运动补偿，并对新出现区域进行修复以生成 P-帧。
通过结合残差信息和遮挡图来计算修复掩模，以引导对被遮挡区域进行基于扩散的修复。
通过使用运动信息，在最近的 I/P 帧之间进行插值并结合扭曲帧的匹配分数融合，生成 B-帧。
使用 FlowFormer 进行光流估计，并采用 ControlNet 条件的扩散以按照给定文本提示翻译帧。

实验结果

研究问题

RQ1如何使用输入视频的运动信息在基于扩散的视频翻译中保持内容一致性？
RQ2是否可以使用基于运动引导的策略有效生成或插值 P-帧和 B-帧，以减少冗余重生成并提高时间一致性？
RQ3VideoControlNet 与现有基于扩散的视频翻译方法在质量、一致性和速度方面的比较？
RQ4将 Diffusion 条件化为不同输入条件（如 canny/深度）对输出质量的实际影响是什么？

主要发现

用户研究显示 VideoControlNet 相较于 Text2Video-Zero 和 CCPL 更受欢迎（74.7% 的投票）。
在 DAVIS 数据集上，VideoControlNet 在客观指标上优于 Text2Video-Zero：FVD 981.99 vs 1670.39; IS 18.02 vs 13.23; FID 92.17 vs 119.01; CLIPSIM 26.14 vs 25.66; LPIPS 0.50 vs 0.56。
光流误差在 VideoControlNet 中较低（7.91）比 Text2Video-Zero（17.99）更低。
VideoControlNet 的帧速为 0.30 fps（平均每帧），而 Text2Video-Zero 为 0.19 fps，GoP=10 和 20 次扩散步长时每帧平均用时 3.4 秒。
P-帧生成（MgPG）和 B-帧插值（MgBI）减少了对每帧进行完整扩散处理的需要，从而提高了速度（MgBI 明显更快）。
该方法通过运动信息驱动的高内容一致性以及具备 ControlNet 的预训练 StableDiffusion 模型的能力得到证明。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。