QUICK REVIEW

[论文解读] DiffEdit: Diffusion-based semantic image editing with mask guidance

Guillaume Couairon, Jakob Verbeek|arXiv (Cornell University)|Oct 20, 2022

Generative Adversarial Networks and Image Synthesis参考文献 54被引用 102

一句话总结

DiffEdit 自动推断用于文本引导的语义图像编辑的区域掩码，使用 DDIM 编码和扩散模型差异，实现无需手动掩码的局部编辑，并在 ImageNet、COCO 和 Imagen 生成的图像上取得了良好结果。

ABSTRACT

Image generation has recently seen tremendous advances, with diffusion models allowing to synthesize convincing images for a large variety of text prompts. In this article, we propose DiffEdit, a method to take advantage of text-conditioned diffusion models for the task of semantic image editing, where the goal is to edit an image based on a text query. Semantic image editing is an extension of image generation, with the additional constraint that the generated image should be as similar as possible to a given input image. Current editing methods based on diffusion models usually require to provide a mask, making the task much easier by treating it as a conditional inpainting task. In contrast, our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited, by contrasting predictions of a diffusion model conditioned on different text prompts. Moreover, we rely on latent inference to preserve content in those regions of interest and show excellent synergies with mask-based diffusion. DiffEdit achieves state-of-the-art editing performance on ImageNet. In addition, we evaluate semantic image editing in more challenging settings, using images from the COCO dataset as well as text-based generated images.

研究动机与目标

推动在应用文本变换的同时尽可能保留输入图像的更多内容的语义图像编辑。
通过在不同文本下自动推断编辑区域来消除对用户提供的掩码的需求。
利用 DDIM 编码更好地保留编辑区域内的输入内容。
将掩码引导与条件扩散相结合，以实现高质量、自然的编辑。
提供理论与实证分析，展示相较于先前基于扩散的编辑方法的优势。

提出的方法

使用文本条件扩散模型，在编辑文本 Q 与参考/空文本下比较噪声估计以推断编辑掩码 M。
通过 DDIM 编码将输入图像编码为潜在变量 y_r，使用无条件模型（无文本）。
在对编辑文本 Q 条件下解码，同时以推断出的掩码为引导，通过用编码潜在变量 x_t 替换背景像素来实现局部编辑。
整合掩码引导的 DDIM 更新：y_t' = M y_t + (1 - M) x_t，利用设置编码比率 r 来控制编辑强度，r 决定去噪步数。
给出 DiffEdit 的 DDIM 编码编辑与 SDEdit 的噪声添加之间的理论比较（命题 1），并在无条件/条件噪声估计相近时解释更紧的边界。

实验结果

研究问题

RQ1是否可以通过对比不同文本提示下的预测来引导扩散模型仅编辑局部区域，而无需用户提供掩码？
RQ2通过 DDIM 编码对输入图像进行编码是否有助于保留外观并实现编辑的无缝整合，相较于简单的噪声添加？
RQ3使用 DDIM 编码掩码时，编辑强度与对原始图像保真度之间会出现哪些权衡？
RQ4在 ImageNet、COCO 以及 Imagen 生成的图像等数据集上，DiffEdit 相对于先前的基于扩散的编辑方法的表现如何？
RQ5参考文本是否在实践中提升掩码质量和编辑效果？

主要发现

DiffEdit 在 ImageNet 上相较于先前的基于扩散的方法达到最先进的编辑性能。
推断的掩码加上 DDIM 编码在 ImageNet、COCO 与 Imagen 生成的图像上比 SDEdit 及其他基线在 CSFID–LPIPS 权衡上表现更好。
消融研究表明，掩码和 DDIM 编码各自独立提升结果，它们的结合提供了最佳权衡。
使用参考文本（原始图像说明）来计算掩码通常会得到更好的编辑，尤其在 Imagen 数据上，通过将更改聚焦在查询与参考不一致的区域。
理论分析（命题 1）表明，在现实的 Lipschitz 与界限假设下，DDIM 编码的编辑相对于简单噪声的 SDEdit 能提供更紧的与输入图像的编辑距离界。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。