QUICK REVIEW

[論文レビュー] Training-Free Layout Control with Cross-Attention Guidance

Minghao Chen, Iro Laina|arXiv (Cornell University)|Apr 6, 2023

Generative Adversarial Networks and Image Synthesis被引用数 8

ひとこと要約

本論文は拡散モデルのレイアウト制御をトレーニング不要で実現するためにクロスアテンションを操作し、生成レイアウトとユーザー指定ボックスの整合性を測る backward guidance が forward guidance より優れていることを示す。

ABSTRACT

Recent diffusion-based generators can produce high-quality images from textual prompts. However, they often disregard textual instructions that specify the spatial layout of the composition. We propose a simple approach that achieves robust layout control without the need for training or fine-tuning of the image generator. Our technique manipulates the cross-attention layers that the model uses to interface textual and visual information and steers the generation in the desired direction given, e.g., a user-specified layout. To determine how to best guide attention, we study the role of attention maps and explore two alternative strategies, forward and backward guidance. We thoroughly evaluate our approach on three benchmarks and provide several qualitative examples and a comparative analysis of the two strategies that demonstrate the superiority of backward guidance compared to forward guidance, as well as prior work. We further demonstrate the versatility of layout guidance by extending it to applications such as editing the layout and context of real images.

研究の動機と目的

テキストから画像生成における堅牢な空間レイアウト制御をファインチューニングなしで実現する動機付け。
クロスアテンションマップがレイアウトに与える影響を調査し、forward guidanceとbackward guidanceを比較する。
ユーザー指定の境界ボックスを用いてレイアウトを操作するトレーニング不要な機構を開発する。
実画像のレイアウト編集への適用性を示し、パーソナライゼーションパイプラインと統合する。

提案手法

レイアウト制御問題を bounding box B をトークン i に対して p(x|y,B,i) からのサンプリングとして表現する。
A^{( Gamma)}_{ui} が空間位置 u をテキストトークン i にリンクするクロスアテンション層を探索する。
ウィンドウ関数 g^{( Gamma)}_{u} によってクロスアテンションマップをバiasする前向きガイダンスを形式化する。
エネルギー関数 E(A^{( Gamma)},B,i) を定義して B 内でのアテンションを促進し、backpropagation によって latent z_t を更新する（z_t ← z_t − σ_t^2 η ∇_{z_t} Σγ E(A^{( Gamma)},B,i)）。
backward guidance は latent を更新することで全トークンのアテンションを間接的に揃える一方、forward guidance は単一トークンのアテンションを直接biasedする点で異なる。
3つのベンチマークで評価し、開始トークンやパディングトークンを含むトークンの役割と初期拡散ノイズの影響を分析する。

Figure 2 : Overview of the two layout guidance strategies. The cross-attention map for a chosen word token is marked with a red border. In forward guidance, the cross-attention maps of the word, start and padding tokens are biased spatially. In backward guidance, we compute instead a loss function a

実験結果

リサーチクエスチョン

RQ1事前学習済み拡散モデルを再訓練せずにレイアウト条件付き画像生成をどのように実現できるか。
RQ2クロスアテンションを介して空間的レイアウトを課す際、backward guidance は forward guidance より効果的か。
RQ3トレーニング不要なレイアウトガイダンスは実画像の編集やパーソナライゼーション手法とどの程度統合できるか。
RQ4レイアウトを形作る主な要因（トークン、初期ノイズ）は拡散生成中にどの程度重要か。

主な発見

Backward guidance は forward guidance より高いオブジェクト組立精度（OA）と VISOR 条件付き指標を達成する。
ノイズ選択を伴う backward guidance は OA と VISOR スコアをベンチマーク全体で大幅に改善する。
開始トークンとパディングトークンのクロスアテンションマップには意味のあるレイアウト情報が含まれており、ガイダンス戦略を助ける。
Backward guidance は COCO および Flickr30K で他のレイアウト条件付き手法と比べて平均精度（mAP）および AP@0.5 を上回る。
このアプローチは Textual Inversion や Dreambooth と組み合わせると実画像のレイアウト編集を可能にし、レイアウトを制御しつつアイデンティティを保持できる。

Figure 3 : Cross-attention maps during forward and backward guidance. Spatial dependencies between different words negatively affect forward guidance, while backward guidance softly encourages all dependent tokens to match the desired layout.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。