QUICK REVIEW

[論文レビュー] FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

Guangxuan Xiao, Tianwei Yin|arXiv (Cornell University)|May 17, 2023

Generative Adversarial Networks and Image Synthesis被引用数 21

ひとこと要約

FastComposer は局所的なクロスアテンションと遅延条件付けを用いた、微調整不要の被写体特化・複数被写体画像生成を実現し、新規被写体に対して大幅な速度アップと追加ストレージ不要を達成します。

ABSTRACT

Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend features among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300$ imes$-2500$ imes$ speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available at https://github.com/mit-han-lab/fastcomposer.

研究の動機と目的

参照画像から導出された被写体埋め込みを用いて個別化生成を行い、被写体ごとのファインチューニングを置換する。
トレーニング時のクロスアテンション局在化を通じて、複数被写体生成時のアイデンティティの混合を軽減する。
推論時に遅延した被写体条件付けを用いることで、テキスト主導の編集を可能にしつつ被写体の一意性を維持する。
複数の未知被写体に対して、推論のみで高忠実度かつ効率的な生成を可能にする。

提案手法

画像エンコーダとMLPを用いて参照画像から導出した被写体埋め込みを用い、視覚特徴とテキスト埋め込みを融合してテキストプロンプトを拡張する。
被写体強化条件付けと、クロスアテンションマップを参照セグメンテーションマスクと整列させる局在化損失を用いて、アイデンティティの混合を防ぐように学習する。
訓練中に被写体のセグメンテーションマスクに似せるよう指導して、クロスアテンションマップを局在化する。
ノイズ除去過程で遅延した被写体条件付けを導入し、初期段階でテキストのみの条件付けを適用してレイアウトを確立し、その後プロセスの後半で被写体条件付けによる細精を行う。
FFHQ からパノプティックマスクと名詞句対応を備えた被写体強化の画像-テキストデータセットを構築し、被写体背景をマスクした状態で Stable Diffusion ベースのモデルを学習させることで一般化を改善する。

Figure 1 : Comparison with baselines for multi-subject image generation. We use scientists’ names in the text prompt for text-only methods (SD, MJ). Text-only methods only perform well when subjects are present in the training dataset but struggle to maintain the identity otherwise. Fine-tuning-base

実験結果

リサーチクエスチョン

RQ1ファインチューニングなしで被写体埋め込みを用いた個別化生成は可能か？
RQ2クロスアテンション局在化は複数被写体生成におけるアイデンティティの混合を減らすか？
RQ3推論時の被写体条件付けを遅延させると、アイデンティティを保持しつつ編集可能性を維持できるか？
RQ4ファインチューニングベースの方法と比較して、推論のみの複数被写体生成の効率向上はどの程度か？

主な発見

方法	画像	アイデンティティ保持	プロンプトの一貫性	総時間	ピークメモリ
StableDiffusion	0	1.88%	28.44%	2 s	6 GB
Textual-Inversion	5	13.52%	21.08%	4998 s	17 GB
Custom Diffusion	5	5.37%	25.84%	789 s	29 GB
FastComposer	1	43.11%	24.25%	2 s	6 GB
StableDiffusion (multi)	0	1.88%	28.44%	2 s	6 GB
Textual-Inversion (multi)	5	13.52%	21.08%	4998 s	17 GB
Custom Diffusion (multi)	5	5.37%	25.84%	789 s	29 GB
FastComposer (multi)	1	43.11%	24.25%	2 s	6 GB

FastComposer はファインチューニングベースの方法と比較して300×から1200×の速度向上と2.8×のメモリ削減を実現。
新規被写体には追加ストレージが不要。
Single-subject generation with FastComposer attains higher identity preservation and competitive prompt consistency versus baselines.
Multi-subject generation maintains higher identity preservation than tuning-based baselines, with prompt consistency comparable to them.
Cross-attention localization improves identity separation among subjects (ablation shows lower identity preservation without localization).
Delayed subject conditioning balances identity preservation with editability, with optimum results when conditioning ratio lies roughly between 0.6 and 0.8.

Figure 2 : Two challenges faced by existing subject-driven image generation methods. Firstly, current methods blend the distinct characteristics of different subjects ( identity blending ), shown by the right figure where Newton resembles Einstein. Cross-attention localization (Sec 4.2 ) solves this

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。