QUICK REVIEW

[论文解读] Image Translation as Diffusion Visual Programmers

Cheng Han, James C. Liang|arXiv (Cornell University)|Jan 18, 2024

Cell Image Analysis Techniques被引用 10

一句话总结

DVP 将条件灵活的扩散模型与 GPT 驱动的可视化编程结合，通过将任务分解为 RoI 识别、编辑和定位，实现可控、可解释的图像翻译；无需手工微调的引导尺度即可实现稳健的高保真翻译。

ABSTRACT

We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic steps, which span RoI identification, style transfer, and position manipulation, facilitating transparent and controllable image translation processes. Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts. This success can be attributed to several key features of DVP: First, DVP achieves condition-flexible translation via instance normalization, enabling the model to eliminate sensitivity caused by the manual guidance and optimally focus on textual descriptions for high-quality content generation. Second, the framework enhances in-context reasoning by deciphering intricate high-dimensional concepts in feature spaces into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]), allowing for localized, context-free editing while maintaining overall coherence. Last but not least, DVP improves systemic controllability and explainability by offering explicit symbolic representations at each programming stage, empowering users to intuitively interpret and modify results. Our research marks a substantial step towards harmonizing artificial image translation processes with cognitive intelligence, promising broader applications.

研究动机与目标

通过定位感兴趣区域（RoIs）并应用有针对性的风格/内容变更，同时保留上下文来翻译图像。
引入一个对条件灵活的扩散模型，减少对手工引导尺度的依赖。
通过可视化编程的情景上下文推理，将高维概念分解为低维符号。
提供显式的中间符号和逐步执行流程，以实现可控性和可解释性。

提出的方法

将一个对条件灵活的扩散模型嵌入到 GPT 中，以规划图像编辑程序的序列。
使用实例归一化引导来解耦无条件和条件预测，并消除对手工引导尺度的依赖。
引入跨注意力将图像特征与文本提示连接起来，以实现对空间可控的编辑。
用像 [Prompt], [RoI object], [Scenario] 的符号定义情景上下文中的可编辑性，以实现无上下文限制的编辑。
实现一个带有操作的 GPT 驱动 Planner：GPlan、PG（Prompter）、Segment、Inpaint、PM（Position Manipulator）。
通过 Compiler 将变量映射到数值并逐步运行操作，同时给出可解释的中间输出。

实验结果

研究问题

RQ1如何在不依赖手工引导尺度的情况下，使基于扩散的图像翻译具备条件灵活性？
RQ2神经符号、可视化编程的方法能否在保持全局连贯性的同时实现精确的 RoI 聚焦编辑？
RQ3显式的符号中间表示是否能提高图像翻译的可控性和可解释性？
RQ4在情景上下文推理中将高维概念解耦为低维符号，是否有助于实现无上下文限制的编辑？

主要发现

方法	质量	保真度	多样性	CLIP-Score	DINO-Score
VQGAN-CLIP	3.25	3.16	3.29	0.749	0.667
Text2Live	3.55	3.45	3.73	0.785	0.659
SDEDIT	3.37	3.46	3.32	0.754	0.642
Prompt2Prompt	3.82	3.92	3.48	0.825	0.657
DiffuseIT	3.88	3.87	3.57	0.804	0.648
VISPROG	3.86	4.04	3.44	0.813	0.651
DVP (ours)	3.95	4.28	3.56	0.839	0.697

DVP 在保真度和质量上在多样性提示下超越最先进的基线方法。
实例归一化引导稳定了翻译，降低对引导尺度的敏感性。
情景上下文可视化编程使局部、可控的编辑具有显式中间符号以提高透明度。
Prompter 生成的注释提高标签效率和最终图像质量。
DVP 展示出强烈的 RoI 聚焦翻译，同时保持背景上下文。
用户研究和 CLIP/DINO 指标显示比竞争对手具有更高的保真度和质量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。