QUICK REVIEW

[论文解读] Structure and Content-Guided Video Synthesis with Diffusion Models

Patrick Esser, Johnathan Chiu|arXiv (Cornell University)|Feb 6, 2023

Generative Adversarial Networks and Image Synthesis被引用 17

一句话总结

论文提出一种结构和内容引导的潜在视频扩散模型，该模型在根据文本或图像提示进行视频编辑的同时，保持输入结构，使用联合图像-视频训练、基于深度的结构以及一种新颖的引导方法来控制时间一致性。

ABSTRACT

Text-guided generative diffusion models unlock powerful image creation and editing tools. While these have been extended to video generation, current approaches that edit the content of existing footage while retaining structure require expensive re-training for every input or rely on error-prone propagation of image edits across frames. In this work, we present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output. Conflicts between user-provided content edits and structure representations occur due to insufficient disentanglement between the two aspects. As a solution, we show that training on monocular depth estimates with varying levels of detail provides control over structure and content fidelity. Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method. Our experiments demonstrate a wide variety of successes; fine-grained control over output characteristics, customization based on a few reference images, and a strong user preference towards results by our model.

研究动机与目标

开发一个可控的视频扩散模型，在编辑内容的同时保持结构。
在不对每个视频单独训练的情况下实现文本与图像引导的视频编辑。
实现对时间、内容和结构保真度的显式控制。
探索在不同细节级别的基于深度的结构表示上进行训练，以调节保真度。
展示对编辑的定制化与用户偏好的能力。

提出的方法

通过在预训练图像模型中增加时间层，将潜在扩散模型扩展到时空域。
用单目深度估计表示结构，且用基于 CLIP 的嵌入表示内容。
在图像和视频上进行联合训练，以通过时间引导尺度在推断时实现时间控制。
在去噪阶段通过拼接将结构 s 条件化，以及通过交叉注意将内容 c 条件化。
使用具有不同模糊度 t_s 的深度图来在训练和推断阶段控制结构保真度。
应用无分类器扩散引导，结合内容和时间引导尺度，以调节提示保真度和时间一致性。

实验结果

研究问题

RQ1 diffusion 模型如何在编辑视频内容的同时保持输入视频的原有结构？
RQ2在推断时进行联合训练在时间一致性控制方面是否有效？
RQ3如何在视频扩散模型中有效地对深度基于结构表示和基于 CLIP 的内容表示进行条件化？
RQ4通过采样引导和结构细节水平，编辑保真度和时间平滑性能在多大程度上被控制？

主要发现

该模型在推断时实现了对时间一致性、结构保真度和内容编辑的细粒度控制。
与仅图像的方法相比，图像和视频数据的联合训练提高了时间一致性。
具有不同细节程度（t_s）的基于深度的结构表示允许控制在编辑中保留多少结构。
内容可通过文本提示或示例图像通过 CLIP 嵌入和将文本转化为图像嵌入的学习先验来引导。
在采样过程中的新型时间引导机制（ω_t）在保持提示一致性的同时提高了帧间连贯性。
用户研究表明该方法在文本和图像引导的视频编辑方面优于若干基线方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。