QUICK REVIEW

[论文解读] Towards Consistent Video Editing with Text-to-Image Diffusion Models

Zicheng Zhang, Bonan Li|arXiv (Cornell University)|May 27, 2023

Generative Adversarial Networks and Image Synthesis被引用 7

一句话总结

EI 2 通过解决时间模块的协变量偏移，利用 STAM 和 FFAM 来提高时间一致性和语义对齐，从而增强文本驱动的视频编辑。

ABSTRACT

Existing works have advanced Text-to-Image (TTI) diffusion models for video editing in a one-shot learning manner. Despite their low requirements of data and computation, these methods might produce results of unsatisfied consistency with text prompt as well as temporal sequence, limiting their applications in the real world. In this paper, we propose to address the above issues with a novel EI$^2$ model towards extbf{E}nhancing v extbf{I}deo extbf{E}diting cons extbf{I}stency of TTI-based frameworks. Specifically, we analyze and find that the inconsistent problem is caused by newly added modules into TTI models for learning temporal information. These modules lead to covariate shift in the feature space, which harms the editing capability. Thus, we design EI$^2$ to tackle the above drawbacks with two classical modules: Shift-restricted Temporal Attention Module (STAM) and Fine-coarse Frame Attention Module (FFAM). First, through theoretical analysis, we demonstrate that covariate shift is highly related to Layer Normalization, thus STAM employs a extit{Instance Centering} layer replacing it to preserve the distribution of temporal features. In addition, {STAM} employs an attention layer with normalized mapping to transform temporal features while constraining the variance shift. As the second part, we incorporate {STAM} with a novel {FFAM}, which efficiently leverages fine-coarse spatial information of overall frames to further enhance temporal consistency. Extensive experiments demonstrate the superiority of the proposed EI$^2$ model for text-driven video editing.

研究动机与目标

提升一-shot TTI 到 TTV 视频编辑中的时间和语义一致性的动机。
研究为何时间模块会引起协变量偏移，从而降低编辑能力。
开发理论上有依据的模块，以在确保时间连贯性的同时维持文本驱动的编辑能力。
为将预训练的 TTI 模型放大到视频编辑任务提供实际的一-shot 调整指南。

提出的方法

提出 EI2，包含两个模块：STAM（Shift-restricted Temporal Attention，时移受限的时序注意力）和 FFAM（Fine-coarse Frame Attention，细粒/粗粒帧注意力）。
理论分析 TA 模块和扩散基变换器中的 Layer Normalization 引起的协变量偏移。
用 Instance Centering 代替 Layer Norm 以约束均值漂移，并应用谱归一化来控制方差漂移。
FFAM 使当前帧的细信息与其他帧的粗信息交互，从而实现全局时空一致性。
通过将空间卷积转换为伪三维并在一-shot 设置中使用扩散损失来对预训练的 LDM 进行扩展，以适应 TTV 模型并进行微调。

实验结果

研究问题

RQ1新增时间模块是否会在基于 TTI 的视频编辑中引起协变量偏移，从而降编辑能力？
RQ2与以往方法相比，STAM 和 FFAM 是否能有效缓解语义差异并改善时间一致性？
RQ3如何在保持全局时间连贯性的前提下，在一-shot 微调的视频扩散模型中保留编辑能力？
RQ4用 Instance Centering 替代 Layer Norm 并应用谱归一化的理论与实际影响是什么？
RQ5在使用 FFAM 与 SCA 时，时间连贯性与计算效率之间的权衡是什么？

主要发现

EI2 在定性比较中相对于 Tune-A-Video、Vid2Vid-zero 和 Video-P2P，获得更佳的语义对齐和时间一致性。
量化结果显示 EI2 具有最高的用户投票数并在保持合理训练与推理成本的情况下，具备具有竞争力的 CLIP 基准帧对齐。
消融实验表明用 Instance Centering 代替 Layer Norm 并应用权重归一化可显著降低协变量偏移并改善文本引导。
FFAM 通过利用当前帧的细粒信息与来自其他帧的下采样粗信息的交互，提供比 SCA 更好的时间连贯性。
所提出的 STAM 能有效限制分布偏移，提高编辑保真度，同时不牺牲时间动态。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。