QUICK REVIEW

[论文解读] InstructPix2Pix: Learning to Follow Image Editing Instructions

Tim Brooks, Aleksander Holynski|arXiv (Cornell University)|Nov 17, 2022

Multimodal Machine Learning Applications被引用 40

一句话总结

基于扩散的模型通过在由 GPT-3 和 Stable Diffusion 生成的大规模合成成对数据集上训练，学习用人类撰写的指令来编辑图像，从而实现对真实图像的零样本编辑，无需对每个示例进行微调。

ABSTRACT

We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.

研究动机与目标

激发一个能够基于自然语言指令而非完整图像描述来编辑图像的系统。
通过从预训练模型生成大规模多模态训练数据来解决数据瓶颈。
开发一个基于扩散的编辑器，在单次前向传播中就能应用广泛的编辑，而无需对每个示例进行微调。

提出的方法

通过在一小部分人工撰写的集合上对 GPT-3 进行微调，并将其应用于 LAION 描述，创建一个包含输入描述、编辑指令和输出描述的大规模成对数据集。
使用 Stable Diffusion 的 Prompt-to-Prompt 将描述对转换为图像对，以在编辑之间促进视觉一致性。
训练一个潜在扩散模型（InstructPix2Pix），以输入图像和编辑指令作为条件，在前向传播中执行编辑。
应用无分类器引导，使用两个条件流（输入图像 cI 和指令 cT），并调整引导权重 sI 和 sT，以在保持对输入的保真度和遵循指令之间取得平衡。
从一个预训练的 Stable Diffusion 检查点初始化编辑器，并在输入中增加一个用于编码输入图像的通道。
用基于 CLIP 的方向性相似度对生成的图像对进行筛选，以提高数据质量。

Figure 2 : Our method consists of two parts: generating an image editing dataset, and training a diffusion model on that dataset. (a) We first use a finetuned GPT-3 to generate instructions and edited captions. (b) We then use StableDiffusion [ 52 ] in combination with Prompt-to-Prompt [ 17 ] to gen

实验结果

研究问题

RQ1基于合成的多模态训练数据，扩散式编辑器是否能够学习遵循自然语言编辑指令？
RQ2两种条件流（输入图像和编辑指令）结合无分类器引导，如何影响编辑保真度与指令遵循性？
RQ3数据集规模和过滤对模型执行更大或更复杂编辑的能力有何影响？
RQ4在推理时模型对真实图像和人类撰写的指令的泛化能力如何？
RQ5在数据驱动、合成训练管线中，遵循编辑指令的主要局限性和偏见是什么？

主要发现

该模型在对真实图像和人类撰写的指令上实现了零样本泛化，无需对每个示例进行微调。
大约 454k 的生成编辑数据集使得包括风格、背景替换和对象变化等多样化编辑成为可能。
两条件的无分类器引导能够在对输入图像的保真度和对指令的遵循之间取得平衡，sT 大致在 5–10，sI 大致在 1–1.5 时得到较强结果。
与 SDEdit 和 Text2Live 相比，InstructPix2Pix 在保持输入图像一致性的同时实现了由指令引导的清晰编辑。
消融实验表明，更多的训练数据和 CLI P 过滤可以提高进行更大编辑的能力并维持图像一致性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。