[论文解读] Inst-Inpaint: Instructing to Remove Objects with Diffusion Models
Inst-Inpaint 是一个基于扩散的框架,能够仅基于自然语言指令从图像中去除对象,而不需要用户绘制的掩模,并在 GQA-Inpaint 数据集上进行训练。
Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.
研究动机与目标
- Motivate instructional image inpainting where removal targets are defined by text prompts rather than binary masks.
- Create a real-image dataset (GQA-Inpaint) with scene-graph–driven prompts for object removal.
- Develop a single-stage conditioned diffusion model (Inst-Inpaint) that removes objects from images using text prompts.
- Evaluate against baselines using metrics including a novel CLIP-based inpainting score.
- Demonstrate quantitative and qualitative improvements over state-of-the-art text-based inpainting methods.
提出的方法
- Propose an end-to-end latent diffusion model (Inst-Inpaint) that takes an image and a text instruction and removes the referenced object without explicit masks.
- Construct the GQA-Inpaint dataset by leveraging GQA scene graphs, obtaining segmentation masks via Detectron2 and Detic, and generating textual removal prompts from scene relations.
- Train two Inst-Inpaint models using a two-stage setup: a fixed first-stage encoder (VQGAN-based for GQA-Inpaint; KL-regularized autoencoder for CLEVR) and a second-stage conditional LDM that integrates the source image and text via concatenation and cross-attention.
- Use a latent diffusion objective on the latent codes with conditioning on the source image and text prompts.
- Apply pre-processing like CascadePSP refinement and mask dilation before inpainting with CRFill to produce target images in the dataset generation pipeline.

实验结果
研究问题
- RQ1Can an instruction-driven diffusion model remove objects from real images without requiring binary masks at inference?
- RQ2How well does a text-conditioned inpainting model perform compared to mask-based and other text-based baselines on real and synthetic datasets?
- RQ3Does CLIP-based inpainting scoring correlate with removal accuracy and realism in instruction-guided edits?
主要发现
| 方法 | FID ↓ | CLIP Dist. ↑ | CLIP Acc. ↑ | CLIP Acc. (top 5) ↑ | RelSim ↑ | Mask IoU ↑ |
|---|---|---|---|---|---|---|
| X-Decoder | 6.360 | 72.2 | 62.6 | 41.5 | - | - |
| InstPix2Pix | 9.972 | 56.8 | 33.5 | 11.8 | - | - |
| CLIPSeg | 8.048 | 71.7 | 57.4 | 33.5 | - | - |
| Inst-Inpaint (Ours) | 5.679 | 76.0 | 77.4 | 57.3 | - | - |
- Inst-Inpaint achieves lower FID scores and higher CLIP-based metrics than competing methods on GQA-Inpaint.
- The model demonstrates strong ability to attend to and remove the instructed object, as shown by attention maps focusing on the target object.
- On CLEVR, Inst-Inpaint outperforms GAN-based baselines in object-removal tasks, indicating effective instruction-conditioned editing even in synthetic settings.
- Compared to Instruct X-Decoder and ClipSeg, Inst-Inpaint yields better mask-prediction accuracy when using a simple UNet on attention maps (IoU scores shown in Table III).
- The approach yields qualitative and quantitative improvements, with the reported table of results indicating superior CLIP accuracy and relational consistency in many cases.
![Figure 2 : The proposed GQA-Inpaint dataset and our Inst-Inpaint method. Our work involves initially generating a dataset for the proposed instructional image inpainting task. To create input/output pairs, we utilize the images and their scene graphs that exist in the GQA dataset [ 18 ] . (a) We fir](https://ar5iv.labs.arxiv.org/html/2304.03246/assets/x2.png)
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。