QUICK REVIEW

[论文解读] ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

Yasheng Sun, Y. F. Yang|arXiv (Cornell University)|Aug 2, 2023

Multimodal Machine Learning Applications被引用 10

一句话总结

ImageBrush 提出了一种基于扩散的框架，使用一对示例视觉指令和查询图像在没有外部语言的情况下执行基于示例的图像操作；它通过视觉提示编码器和边界框提示，在潜在空间逐步修复。

ABSTRACT

While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.

研究动机与目标

在不使用跨模态语言的前提下，通过示例演示学习视觉指令来实现对图像的忠实操控。
开发一个能够理解同一示例内外关系并对新的查询图像应用编辑的扩散框架。
消除对语言提示的依赖，降低模态差距并提升在现实场景中的可访问性。
提出一个视觉提示编码器和边界框交互机制以捕捉高层次的人类意图。
展示对姿态迁移、图像翻译和视频修复等下游任务的泛化能力。

提出的方法

将示例驱动的操作形式化为在网格状输入中进行渐进修复，该输入连接 E、E′、I 以及一个空白 M，从而迭代地恢复 E、E′、I、I′。
在潜在空间中使用带有 UNet 主干和跨注意力的潜在扩散模型（Latent Diffusion Model），以注入视觉提示上下文。
引入一个视觉提示模块，具有共享的视觉编码器 e_v 和提示编码器 e_p，用于从提示中提取高层语义上下文；通过在中间块处的跨注意力将 f_c 与 UNet 融合。
通过边界框编码 e_b 和傅里叶嵌入引入感兴趣区域提示，以创建接地特征；支持使用 GroundingDINO 自动 ROI 或手动框选。
使用无分类器引导的扩散，设定一个尺度参数以将生成引导至与指令对齐的编辑。
采用基于边界框的界面来捕捉用户焦点，在指令学习阶段实现对人类意图的更丰富理解。

Figure 1: Demo results of the proposed ImageBrush framework on various image manipulation tasks. By providing a pair of task-specific examples and a new query image that share a similar context, ImageBrush accurately identifies the underlying task and generates the desired output.

实验结果

研究问题

RQ1是否仅通过视觉示例即可引导图像编辑，同时忠实反映用户意图而不依赖语言信号？
RQ2扩散模型如何利用上下文中的视觉指令，在新查询图像上执行示例驱动的编辑？
RQ3将高层语义与用户指定区域编码到视觉提示框架中以进行图像编辑，有哪些有效机制？
RQ4示例驱动的视觉指令是否能在图像翻译、姿态迁移和视频修复等任务中实现泛化？

主要发现

方法	Scannet	LRW (Edge)	LRW (Mask)	UBC-Fashion	DAVIS
TSAM	-	-	-	-	86.84
CoCosNet	19.49	15.44	14.25	38.61	-
ImageBrush	9.18	9.67	8.95	12.99	18.70

ImageBrush 的编辑结果与示例对及查询上下文中的变换保持一致。
该方法在野外数据集上实现了跨示例的图像翻译、姿态迁移和视频修复的稳健泛化。
具有渐进去噪过程的扩散修复与视觉提示编码器提升了上下文利用和编辑保真度。
视觉提示结合边界框 ROI 的整合显著提升了对人类意图与区域聚焦编辑的跟随效果。
在多任务的野外基准上，ImageBrush 在方向一致性和图像相似度指标上优于基线，并在跨任务的单一模型上取得了具有竞争力的结果。

Figure 2: Illustration of ImageBrush. We introduce a novel and intuitive way of interacting with images. Users can easily manipulate images by providing a pair of examples and a query image as prompts to our system. If users wish to convey more precise instructions, they have the option to inform th

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。