QUICK REVIEW

[论文解读] InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

Qian Wang, Biao Zhang|arXiv (Cornell University)|May 29, 2023

Multimodal Machine Learning Applications被引用 9

一句话总结

InstructEdit 通过语言处理与 Grounded Segment Anything 来生成由用户指令引导的高质量掩模，从而实现对多对象图像的精细化编辑，提升编辑准确性。

ABSTRACT

Recent works have explored text-guided image editing using diffusion models and generated edited images based on text prompts. However, the models struggle to accurately locate the regions to be edited and faithfully perform precise edits. In this work, we propose a framework termed InstructEdit that can do fine-grained editing based on user instructions. Our proposed framework has three components: language processor, segmenter, and image editor. The first component, the language processor, processes the user instruction using a large language model. The goal of this processing is to parse the user instruction and output prompts for the segmenter and captions for the image editor. We adopt ChatGPT and optionally BLIP2 for this step. The second component, the segmenter, uses the segmentation prompt provided by the language processor. We employ a state-of-the-art segmentation framework Grounded Segment Anything to automatically generate a high-quality mask based on the segmentation prompt. The third component, the image editor, uses the captions from the language processor and the masks from the segmenter to compute the edited image. We adopt Stable Diffusion and the mask-guided generation from DiffEdit for this purpose. Experiments show that our method outperforms previous editing methods in fine-grained editing applications where the input image contains a complex object or multiple objects. We improve the mask quality over DiffEdit and thus improve the quality of edited images. We also show that our framework can accept multiple forms of user instructions as input. We provide the code at https://github.com/QianWangX/InstructEdit.

研究动机与目标

实现从用户指令进行的精细图像编辑，无需手工掩模。
提升多对象图像中的对象定位与编辑准确性。
利用预训练的语言、分割与扩散模型自动化整个流程。

提出的方法

用大语言模型解析用户指令，生成分割提示和输入/编辑后的字幕。
基于分割提示，使用 Grounded Segment Anything（Grounded SAM）生成高质量掩模。
结合掩模与扩散式编辑器（Stable Diffusion 配合掩模引导的 DDIM）使用输入与编辑后字幕来编辑图像。
使用 DDIM 反演将输入图像编码为噪声张量，并通过编码比率 r 控制编辑强度。
如指令不明确，可选性地引入 BLIP2 描述图像以改进提示。
使用 LPIPS、CLIP 分数与 CLIP 方向相似性，以及用户研究来评估编辑质量。

实验结果

研究问题

RQ1能否有效将用户指令解析为驱动分割与编辑提示而无需手动掩模？
RQ2基于 grounding 的掩模（Grounded SAM）是否相比无掩模基线在多对象图像中提升细粒度编辑？
RQ3在单对象与多对象场景中，指令驱动编辑在语义保真与指令遵循方面的表现如何？

主要发现

Method	LPIPS ↓	CLIP score ↑	CLIP directional similarity ↑
MDP-ε_t	0.214	26.414	0.079
InstructPix2Pix	0.290	25.844	0.114
DiffEdit	0.167	26.847	0.106
InstructEdit	0.121	27.404	0.082

InstructEdit 在定量指标上实现了比基线更好的语义保真与指令对齐。
InstructEdit 相对于 DiffEdit 提升了掩模质量，从而在复杂场景中实现更高的图像编辑保真度。
通过 Grounded SAM，该方法定位并编辑目标对象或区域，减少溢出或定位错误。
BLIP2 辅助的提示在用户描述模糊或不完整时提升了编辑质量。
用户研究显示在 10 个测试编辑中，用户更偏好 InstructEdit 相对于基线方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。