QUICK REVIEW

[論文レビュー] IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye Hu, Jun Zhang|arXiv (Cornell University)|Aug 13, 2023

Generative Adversarial Networks and Image Synthesis被引用数 109

ひとこと要約

IP-Adapterは、事前学習済みのテキスト対画像拡散モデル向けの軽量でデカップルドなクロスアテンション画像プロンプトアダプターを導入し、約22Mパラメータで画像プロンプト機能を実現し、強力な一般化能力を発揮する。

ABSTRACT

Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at \url{https://ip-adapter.github.io}.

研究の動機と目的

基盤となる拡散モデルをファインチューニングせずに画像プロンプト生成を有効にする。
画像プロンプトを組み込みつつ、テキスト対画像機能を保持する軽量アダプターを設計する。
カスタムモデルへの高い一般化性能を達成し、ControlNetなどのコントロラブルツールとの互換性を確保する。
画像プロンプトとテキストプロンプトを組み合わせたマルチモーダル promptingを実証する。

提案手法

グローバルな画像埋め込みを生成するために、画像エンコーダ（CLIP画像エンコーダ）を追加する。
デカップルドクロスアテンションを導入する：各UNetクロスアテンションで画像特徴の新しいクロスアテンション層を追加し、学習可能なK/V射影を用意する。
収束を速めるために画像K/V射影をテキストK/Vから初期化し、アダプターのパラメータのみを学習する（合計約22M）。
テキストc_tと画像c_iを条件付けとして、同じ拡散目的関数L_simpleで訓練する。画像/テキストのプロンプトをランダムにドロップして classifier-free ガイダンスを可能にする。
推論時には、マルチモーダルプロンプトのバランスを取るために、パラメータlambdaで画像ガイダンスとテキストガイダンスの重みを任意に調整する。
基盤となる拡散モデルを改変せずに、既存のコントロラブルアダプター（例：ControlNet）との互換性を実証する。

実験結果

リサーチクエスチョン

RQ1基盤モデルをファインチューニングせずに、画像プロンプトを事前学習済みのテキスト対画像拡散モデルに統合できるか？
RQ2デカップルドクロスアテンション設計は、単純な特徴の連結や他のアダプターよりも画像プロンプトの忠実度を向上させるか？
RQ3IP-Adapterは、同じbaseモデルから派生したカスタムモデル間で再利用可能で、既存のコントロールツールと互換性があるか？
RQ4画像プロンプトをテキストプロンプトと効果的に組み合わせてマルチモーダル生成を実現できるか？

主な発見

方法	カスタムモデルへ再利用可能	コントローラブルツールと互換	マルチモーダルプロンプト	学習可能パラメータ	CLIP-T	CLIP-I
Open unCLIP	✗	✗	✗	893M	0.608	0.858
Kandinsky-2-1	✗	✗	✗	1229M	0.599	0.855
Versatile Diffusion	✗	✗	✓	860M	0.587	0.830
SD Image Variations	✗	✗	✗	860M	0.548	0.760
SD unCLIP	✗	✗	✗	870M	0.584	0.810
Uni-ControlNet (Global Control)	✓	✓	✓	47M	0.506	0.736
T2I-Adapter (Style)	✓	✓	✓	39M	0.485	0.648
ControlNet Shuffle	✓	✓	✓	361M	0.421	0.616
IP-Adapter	✓	✓	✓	22M	0.588	0.828

22MパラメータのIP-Adapterは、いくつかの完全にファインチューニングされた画像プロンプトモデルと同等、あるいはそれ以上の結果を達成する。
デカップルドクロスアテンション設計は、画像特徴をクロスアテンションに連結するだけの簡易アダプターより優れている。
IP-Adapterは、同じbaseモデルから派生したカスタムモデルで再利用可能で、ControlNetのようなコントローラブルツールとも互換性を維持する。
本手法はマルチモーダルプロンプトをサポートし、生成時の画像条件とテキスト条件のバランスのとれた利用を可能にする。
COCOの定量的結果は、IP-Adapterがいくつかのアダプターを上回り、ファインチューニング済みのベースライン（CLIP-TおよびCLIP-I指標）と同等またはそれ以上を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。