QUICK REVIEW

[論文レビュー] Image Translation as Diffusion Visual Programmers

Cheng Han, James C. Liang|arXiv (Cornell University)|Jan 18, 2024

Cell Image Analysis Techniques被引用数 10

ひとこと要約

DVP は、条件に柔軟な拡散モデルと GPT 駆動のビジュアル・プログラミングを組み合わせ、RoI の同定・編集・局在化というタスク分解によって、 controllable で explainable な画像翻訳を実現します。手動で調整されたガイダンススケールに頼らず、堅牢で高忠実度の翻訳を達成します。

ABSTRACT

We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic steps, which span RoI identification, style transfer, and position manipulation, facilitating transparent and controllable image translation processes. Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts. This success can be attributed to several key features of DVP: First, DVP achieves condition-flexible translation via instance normalization, enabling the model to eliminate sensitivity caused by the manual guidance and optimally focus on textual descriptions for high-quality content generation. Second, the framework enhances in-context reasoning by deciphering intricate high-dimensional concepts in feature spaces into more accessible low-dimensional symbols (e.g., [Prompt], [RoI object]), allowing for localized, context-free editing while maintaining overall coherence. Last but not least, DVP improves systemic controllability and explainability by offering explicit symbolic representations at each programming stage, empowering users to intuitively interpret and modify results. Our research marks a substantial step towards harmonizing artificial image translation processes with cognitive intelligence, promising broader applications.

研究の動機と目的

RoI（領域）を特定し、文脈を保ちながらターゲットを絞ったスタイル/内容の変更を適用して画像を翻訳する。
手動のガイダンススケールへの依存を減らす、条件に柔軟な拡散モデルを導入する。
ビジュアルプログラミングを通じた文脈内推論を可能にし、高次元の概念を低次元のシンボルに分解する。
制御性と説明可能性のために、明示的な中間シンボルと段階的な実行フローを提供する。

提案手法

GPT内に条件に柔軟な拡散モデルを埋め込み、画像編集プログラムの連携を計画する。
インスタンス正規化ガイダンスを用いて無条件予測と条件付き予測をデカップリングし、手動で調整されたガイダンススケールへの依存を排除する。
空間的に制御可能な編集のために、画像特徴とテキストプロンプトを結ぶクロスアテンションを組み込む。
文脈内の可視プログラミングを、[Prompt], [RoI object], [Scenario] のようなシンボルで定義し、文脈依存でない編集を可能にする。
GPlan, PG (Prompter), Segment, Inpaint, PM (Position Manipulator) などの操作を備えた GPT駆動のプランナーを実装する。
変数を値に対応付け、説明可能な中間出力を伴って一歩ずつ操作を実行する Compiler を介してプログラムを実行する。

実験結果

リサーチクエスチョン

RQ1拡散ベースの画像翻訳を、手動のガイダンススケールなしで条件に柔軟にするにはどうすればよいか？
RQ2ニューロ-シンボリックでビジュアルプログラミング的アプローチは、グローバルな整合性を維持しつつ、精密な RoI 指向の編集を可能にできるか？
RQ3明示的なシンボリック中間表現は、画像翻訳の制御性と説明可能性を向上させるか？
RQ4文脈内推論は、高次元の概念を低次元のシンボルに分解して、文脈に依存しない編集をサポートできるか？

主な発見

Method	Quality	Fidelity	Diversity	CLIP-Score	DINO-Score
VQGAN-CLIP	3.25	3.16	3.29	0.749	0.667
Text2Live	3.55	3.45	3.73	0.785	0.659
SDEDIT	3.37	3.46	3.32	0.754	0.642
Prompt2Prompt	3.82	3.92	3.48	0.825	0.657
DiffuseIT	3.88	3.87	3.57	0.804	0.648
VISPROG	3.86	4.04	3.44	0.813	0.651
DVP (ours)	3.95	4.28	3.56	0.839	0.697

DVP は、多様なプロンプトに対して忠実度と品質で最先端のベースラインを上回る。
インスタンス正規化ガイダンスは翻訳を安定させ、ガイダンススケールへの感度を排除する。
文脈内ビジュアルプログラミングは、透明性のための明示的な中間シンボルを備えた局所的で制御可能な編集を可能にする。
Prompter生成の注釈はラベル効率と最終画像品質を向上させる。
DVP は背景コンテキストを保持しつつ、強力な RoI 指向の翻訳を示す。
ユーザ研究と CLIP/DINO 指標は、競合より高い忠実度と品質を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。