Skip to main content
QUICK REVIEW

[论文解读] TokenFlow: Consistent Diffusion Features for Consistent Video Editing

Michal Geyer, Omer Bar-Tal|arXiv (Cornell University)|Jul 19, 2023
Generative Adversarial Networks and Image Synthesis被引用 40
一句话总结

TokenFlow 通过使用帧间对应关系跨帧传播扩散特征来实现视频编辑的时序一致性,从而在不训练的情况下实现高质量、文本驱动的视频编辑。它在时序连贯性方面超过基线的逐帧编辑方法。

ABSTRACT

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing. Specifically, given a source video and a target text-prompt, our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video. Our method is based on a key observation that consistency in the edited video can be obtained by enforcing consistency in the diffusion feature space. We achieve this by explicitly propagating diffusion features based on inter-frame correspondences, readily available in the model. Thus, our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method. We demonstrate state-of-the-art editing results on a variety of real-world videos. Webpage: https://diffusion-tokenflow.github.io/

研究动机与目标

  • 通过使用预训练的图像扩散模型来提升视频编辑质量和时序一致性。
  • 利用扩散特征空间在编辑过程中强制实现帧间一致性。
  • 提供与现成图像编辑方法兼容的无训练框架。
  • 在多样化真实视频上 demonstrate 先进行时序一致性达到最先进水平。
  • 分析扩散特征属性及其与视频冗余的关系。

提出的方法

  • 从 DDIM 反演视频帧中提取扩散令牌(diffusion tokens)跨层遍历。
  • 对一组关键帧进行联合编辑,通过在多帧之间的扩展注意力实现全局外观的共享。
  • 使用原始扩散特征空间中的最近邻对应关系,将编辑后的令牌传播到非关键帧。
  • 在每次去噪步骤将关键帧编辑与 TokenFlow 传播相结合,以保持逐帧一致性。
  • 支持使用任何图像编辑扩散方法进行传播(PnP、Meng 等、Zhang & Agrawala 等)。
  • 通过扭曲误差(warp-error)和用户研究评估时序一致性,通过 CLIP 相似度评估保真度。
Figure 2: Fine-grained feature correspondences. Features (i.e., output tokens from the self-attention modules) extracted from of a source frame are used to reconstruct nearby frames. This is done by: (a) swapping each feature in the target by its nearest feature in the source, in all layers and all
Figure 2: Fine-grained feature correspondences. Features (i.e., output tokens from the self-attention modules) extracted from of a source frame are used to reconstruct nearby frames. This is done by: (a) swapping each feature in the target by its nearest feature in the source, in all layers and all

实验结果

研究问题

  • RQ1在使用预训练图像扩散模型时,扩散特征空间的一致性是否能产生更具时序连贯性的视频编辑?
  • RQ2联合关键帧编辑加上特征空间传播是否在保持结构和运动方面优于逐帧编辑基线?
  • RQ3扩散特征如何反映自然视频中的时序冗余,是否可以利用以实现更好的编辑?

主要发现

  • TokenFlow 在时序连贯性方面超越基线,具有更低的 warp 误差和用户偏好度高。
  • 我们在联合注意力下并使用随机关键帧时,在时序一致性方面优于逐帧编辑基线。
  • 我们的方法在对比方法中达到最高的 CLIP 分数,表明与目标提示的对齐良好。
  • 定性结果显示在多样化视频中 edits 能保持原始运动和语义布局。
  • 消融表明 TokenFlow 在单纯扩展注意力的基础上仍有优势,随机关键帧提高鲁棒性。
  • 基于表格的定量结果在 warp-err、CLIP 相似度和用户偏好方面显示出明显的提升。
Figure 3: Diffusion features across time. Left: Given an input video (top row), we apply DDIM inversion on each frame and extract features from the highest resolution decoder layer in $\epsilon_{\theta}$ . We apply PCA on the features (i.e., output tokens from the self-attention module) extracted fr
Figure 3: Diffusion features across time. Left: Given an input video (top row), we apply DDIM inversion on each frame and extract features from the highest resolution decoder layer in $\epsilon_{\theta}$ . We apply PCA on the features (i.e., output tokens from the self-attention module) extracted fr

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。