QUICK REVIEW

[论文解读] GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models

Sai Sree Harsha, Ambareesh Revanur|arXiv (Cornell University)|Apr 18, 2024

Generative Adversarial Networks and Image Synthesis被引用 6

一句话总结

GenVideo 使用目标图像和具有形状感知的 InvEdit 掩模进行视频编辑，并结合潜在纠正，以在目标形状与源形状不同时时确保时间上的一致性编辑。

ABSTRACT

Video editing methods based on diffusion models that rely solely on a text prompt for the edit are hindered by the limited expressive power of text prompts. Thus, incorporating a reference target image as a visual guide becomes desirable for precise control over edit. Also, most existing methods struggle to accurately edit a video when the shape and size of the object in the target image differ from the source object. To address these challenges, we propose "GenVideo" for editing videos leveraging target-image aware T2I models. Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit using our novel target and shape aware InvEdit masks. Further, we propose a novel target-image aware latent noise correction strategy during inference to improve the temporal consistency of the edits. Experimental analyses indicate that GenVideo can effectively handle edits with objects of varying shapes, where existing approaches fail.

研究动机与目标

通过利用目标图像作为可视化指南，在文本不足以实现精确编辑时实现精准的视频编辑。
实现目标对象的形状和大小与源对象不同的编辑。
在编辑过程中维持跨帧的时间一致性。
提供可适配于图像条件扩散模型的掩模引导推理框架。

提出的方法

在源视频上对扩展的 SD-unCLIP 模型进行微调，使其接受目标图像和文本条件。
通过在 DDIM 步骤中对比源去噪和目标去噪，生成目标图像及具有形状感知的 InvEdit 掩模。
在 UNet 推理过程中使用潜在融合方案将目标图像嵌入到被掩蔽的区域。
在推理阶段应用潜在噪声纠正策略，以提高跨帧时间一致性。
通过 InvEdit 掩模引导潜在融合，保留背景或对其进行选择性修改。

Figure 2 : Overview of GenVideo . Inflated attention layers are finetuned during source video finetuning. During inference, InvEdit predicts a region to edit and latent correction uses that mask to improve the inter-frame temporal consistency. $\mathcal{M}_{\phi}$ - “no mask”.

实验结果

研究问题

RQ1当目标对象在形状/尺寸上与源对象不同时，目标图像引导是否能够实现准确编辑？
RQ2InvEdit 是否提供对视频编辑的形状感知的掩模定位？
RQ3潜在纠正策略是否能够在形状改变的编辑中提高跨帧的时间一致性？

主要发现

CLIP-T	DINO	温度	文本	图像	可视
0.238	0.236	0.957	3.6	3.3	4.2
0.234	0.189	0.980	4.3	4.3	3.7
0.231	0.216	0.985	3.3	3.8	2.1
0.235	0.262	0.951	3.9	3.6	3.4
0.234	0.195	0.949	4.0	4.1	5.0
0.241	0.374	0.967	1.7	1.8	2.3

GenVideo 在用户研究中在目标文本与目标图像对齐方面优于最先进的基线。
InvEdit 掩模实现了对编辑的精确、形状感知定位，适当时可保留背景。
潜在纠正通过使用跨帧特征对应来混合潜在变量，从而提高跨帧时间一致性。
GenVideo 展示了对改变形状目标的零-shot 图像编辑能力，如从车到巴士，且保持连贯性。
量化指标显示 GenVideo 在文本和图像对齐方面的 CLIP-T 和 DINO 得分更高，且用户对基线的排序总分更低。

Figure 3 : InvEdit approach - the mask is generated by first iteratively computing noise differences across multiple timesteps for the source denoising branch and target denoising branch. Then, these differences are averaged and binarized to obtain the InvEdit mask.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。