QUICK REVIEW

[论文解读] Edit-A-Video: Single Video Editing with Object-Aware Consistency

Chaehun Shin, Heeseung Kim|arXiv (Cornell University)|Mar 14, 2023

Generative Adversarial Networks and Image Synthesis被引用 12

一句话总结

Edit-A-Video 通过将一个二维扩散模型扩大到三维，反演源视频并注入注意力映射来以文本提示引导对单个视频进行编辑，并通过一种新颖的时间一致性混合以保持背景一致性。

ABSTRACT

Despite the fact that text-to-video (TTV) model has recently achieved remarkable success, there have been few approaches on TTV for its extension to video editing. Motivated by approaches on TTV models adapting from diffusion-based text-to-image (TTI) models, we suggest the video editing framework given only a pretrained TTI model and a single pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules and tuning on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. Each stage enables the temporal modeling and preservation of semantic attributes of the source video. One of the key challenges for video editing include a background inconsistency problem, where the regions not included for the edit suffer from undesirable and inconsistent temporal alterations. To mitigate this issue, we also introduce a novel mask blending method, termed as sparse-causal blending (SC Blending). We improve previous mask blending methods to reflect the temporal consistency so that the area where the editing is applied exhibits smooth transition while also achieving spatio-temporal consistency of the unedited regions. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.

研究动机与目标

仅使用一个预训练的文本到图像模型和一个单一的 <text, video> 对来推动文本引导的视频编辑。
开发一个两阶段框架，将2D模型扩展为3D以进行时序建模，并通过反演与注意力映射注入来执行编辑。
通过一种新颖的时间一致性混合 (TC Blending) 来缓解背景不一致问题，随时间保留未编辑区域。
分析不同注意力模块在实现时序一致性和内容保留中的作用。

提出的方法

通过添加时间模块并将2D卷积和自注意力转换为它们的时间对应物，将预训练的2D TTI 模型膨胀为3D TTV 模型。
通过 DDIM 反演将源视频反演为高斯噪声，并优化空文本嵌入，使在编辑过程中能够重建源视频。
通过将源注意力映射注入目标文本生成过程进行编辑，以使编辑内容与源的空间布局对齐。
引入 Temporal-Consistent Blending (TC Blending)，生成帧一致的混合掩模，在保持背景随时间的一致性的同时定位编辑区域。
计算稀疏时空注意力（ST-Attn），将当前帧特征与第一帧和前一帧相关联以进行掩模构建。
对 Cross-Attention、Temporal Attention 与 ST-Attn 在维持时序一致性与编辑保真度方面的作用进行分析。

实验结果

研究问题

RQ1将文本到图像扩散模型膨胀为视频模型并在单个视频上调优，是否能够产生由目标文本驱动且具有时序一致性的编辑？
RQ2注意力映射注入是否能够在保持未编辑区域跨帧的同时实现目标对象的忠实编辑？
RQ3TC Blending 是否能够产生逐帧清晰且时序一致的掩模，以降低编辑视频中的背景不一致？
RQ4不同注意力模块（Cross-Attention、Temporal Attention、ST-Attn）对编辑质量与时序一致性的贡献是什么？

主要发现

方法	用户评分 (O)	文本对齐	LPIPS	PSNR
Edit-A-Video (Ours)	3.80±0.10	30.2688	0.2625	20.0992
Tune-A-Video	3.46±0.10	30.0514	0.4482	14.5753
SDEdit	3.40±0.10	28.4203	0.2711	20.4767
Video-P2P	3.66±0.10	30.0842	0.3047	17.5760

Edit-A-Video 相对于基线在背景保留、文本对齐和视频真实性方面获得更高的用户偏好分数。
定量结果显示 Edit-A-Video 的 User Score (O) 为 3.80±0.10，Text Alignment 30.2688，LPIPS 0.2625，PSNR 20.0992，在大多数指标上优于 Tune-A-Video、SDEdit 和 Video-P2P。
TC Blending 提升目标对象掩模和背景保留，其用户分数更高，LPIPS/PSNR 和掩模 IoU 均优于消融变体。
消融研究表明 TC Blending 能产生更清晰、时序一致的掩模，减少背景不一致。
Cross-Attention 注入时长 (0.2) 在保持空间布局的同时实现目标语义；Temporal Attention (0.8) 显示出稳健的时序建模；ST-Attn (0.5) 在动态行动与编辑焦点之间取得平衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。