QUICK REVIEW

[论文解读] Text2LIVE: Text-Driven Layered Image and Video Editing

Omer Bar-Tal, Dolev Ofri-Amar|arXiv (Cornell University)|Apr 5, 2022

Generative Adversarial Networks and Image Synthesis被引用 27

一句话总结

Text2LIVE 学会通过生成一个编辑层（RGBA）并将其合成覆盖在输入图像或视频之上，进行零样本、文本引导、局部编辑；在一个内部的图像-文本数据集上训练，不使用掩模或预训练生成器，并扩展到通过 neural layered atlases 实现时间一致的视频编辑。

ABSTRACT

We present a method for zero-shot, text-driven appearance manipulation in natural images and videos. Given an input image or video and a target text prompt, our goal is to edit the appearance of existing objects (e.g., object's texture) or augment the scene with visual effects (e.g., smoke, fire) in a semantically meaningful manner. We train a generator using an internal dataset of training examples, extracted from a single input (image or video and target text prompt), while leveraging an external pre-trained CLIP model to establish our losses. Rather than directly generating the edited output, our key idea is to generate an edit layer (color+opacity) that is composited over the original input. This allows us to constrain the generation process and maintain high fidelity to the original input via novel text-driven losses that are applied directly to the edit layer. Our method neither relies on a pre-trained generator nor requires user-provided edit masks. We demonstrate localized, semantic edits on high-resolution natural images and videos across a variety of objects and scenes.

研究动机与目标

通过简单的文本提示，激发并实现对现实世界图像和视频的语义层次局部外观编辑。
开发一个分层编辑框架，生成一个 RGBA 编辑层以叠加在输入之上，同时保持内容保真。
利用单输入的内部学习和基于 CLIP 的损失，避免依赖预训练生成器。
利用 neural layered atlases 将该方法扩展到视频，以确保时间上的连续性。
展示跨对象和场景的多样化编辑（纹理和半透明效果）。

提出的方法

引入生成器 Gθ，输出编辑层 E = {C, α}（颜色 C 和不透明度 α），将其叠加在源图像 I_s 上以产生 I_o = α·C + (1−α)·I_s。
使用三种基于 CLIP 的损失来引导编辑：L_comp（最终图像与目标文本 T 匹配）、L_screen（编辑层与屏幕提示 T_screen 匹配以用于绿幕监督），以及 L_structure（通过 CLIP 特征自相似性保持内容结构）。
应用稀疏性正则化 L_reg 以鼓励 α 的稀疏性，从而控制编辑范围。
用来自文本 ROI 提示 T_ROI 的相关性映射 R(I_s) 对 α 进行自举初始化以定位，训练过程中逐步退火。
从头开始在内部数据集上训练 Gθ，该数据集通过对单一输入图像 (I_s) 和目标文本 (T) 进行数据增强来创建多样化的训练对。
通过采用 Neural Layered Atlases (NLA) 将其扩展到视频；训练生成器来编辑 atlas 级层 E_A，并通过固定的 UV 映射 M 将它们映射到帧，从而确保时间上的一致性。

实验结果

研究问题

RQ1在没有掩模或预训练生成器的情况下，是否可以为现实世界图像生成文本驱动的局部编辑？
RQ2生成 RGBA 编辑层是否能提供比直接图像生成更好的控制和对 CLIP 指导编辑的保真度？
RQ3该方法是否可以通过分层图集表示扩展到具有时间一致性的视频？
RQ4内部（单输入）训练和基于文本的损失在将编辑约束在期望区域和语义方面的效果如何？

主要发现

该方法在广泛的对象和场景中实现语义的局部编辑，包括纹理变化和半透明效果。
编辑层（RGBA）通过专用的基于 CLIP 的损失实现对定位和内容的精确控制，从而提高对目标提示的保真度。
通过对单一输入进行内部学习并用增强的文本-图像对，能够在没有外部生成器或掩模的情况下产生高质量编辑。
使用 Neural Layered Atlases 的视频扩展实现从图集编辑映射到帧的时间上保持一致的编辑。
主观 AMT 评估在图像和视频任务中与无掩模基线相比显示出具有竞争力或更优的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。