QUICK REVIEW

[論文レビュー] StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

Senmao Li, Joost van de Weijer|arXiv (Cornell University)|Mar 28, 2023

Generative Adversarial Networks and Image Synthesis被引用数 12

ひとこと要約

StyleDiffusion は real images から prompt-embedding を学習して編集を行い、cross-attention の value path を用いて編集します。attention regularization と unconditional branch を活用する P2Plus 編集スキームにより、より正確な編集を実現します。

ABSTRACT

A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images.They either finetune the model, or invert the image in the latent space of the pretrained model. However, they suffer from two problems: (1) Unsatisfying results for selected regions and unexpected changes in non-selected regions.(2) They require careful text prompt editing where the prompt should include all visual objects in the input image.To address this, we propose two improvements: (1) Only optimizing the input of the value linear network in the cross-attention layers is sufficiently powerful to reconstruct a real image. (2) We propose attention regularization to preserve the object-like attention maps after reconstruction and editing, enabling us to obtain accurate style editing without invoking significant structural changes. We further improve the editing technique that is used for the unconditional branch of classifier-free guidance as used by P2P. Extensive experimental prompt-editing results on a variety of images demonstrate qualitatively and quantitatively that our method has superior editing capabilities compared to existing and concurrent works. See our accompanying code in Stylediffusion: \url{https://github.com/sen-mao/StyleDiffusion}.

研究の動機と目的

拡張されたプロンプト設計や完全なモデル微調整を必要とせず、拡散モデルを用いた実画像の正確なテキスト駆動編集を動機づける。
入力画像を value ブランチ（プロンプト埋め込み）へマッピングし、キーブランチを固定して注意マップを保持する。
再構成誤差と注意誤差の両方を用いたマッピングネットワーク M_t の訓練（L_rec + L_att）を行う。
大規模な構造編集に対応するため、conditional および unconditional の両方のブランチの自己注意マップを置換する P2Plus を提案する（注入タイムステップ τ_u を調整可能）。
ベースラインと比較して、定性的・定量的指標の両方で優れた編集精度と構造保持を実証的に示す。

提案手法

実画像に対する latent コードと注意マップを得る出発点として DDIM inversion を用いる。
入力画像をクロスアテンションの値のストリームへフィードするプロンプト埋め込みへマッピングし、キー埋め込みは凍結して保持する。
inverted latents と reconstruction latents および注意マップをそろえる再構成損失と注意損失を用いてマッピングネットワーク M_t を訓練する（L_rec + L_att）。
構造的な編集をより忠実に行うため、自己注意マップを conditional および unconditional ブランチの双方で置換する P2Plus を提案する（ tunable injection timestep τ_u）。
P2P 的なプロンプト・トゥ・プロンプトガイダンス（及び unconditional branch 拡張）を用いて、背景構造を保持しつつ物体レベルの編集を改善する。

実験結果

リサーチクエスチョン

RQ1実編集を行う際、編集されていない領域の劣化や過度なプロンプト設計を不要にするにはどうすればよいか？
RQ2cross-attention の value path に編集を制限して構造を保持しつつターゲットとなるスタイル編集を可能にできるか？
RQ3P2Plus によって unconditional ブランチの注意を取り入れることで、P2P と比較して大規模構造編集は改善されるか？
RQ4注意正則化と DDIM ベース inversion は、既存の inversion 法より再構成性と編集性を高めるか？

主な発見

Metric	Structure-dist ↓	NS-LPIPS ↓	Clipscore ↑
DDIM	0.092	0.4131	81.9 %
SDEit	0.046	0.2473	78.0 %
Null-text	0.027	0.1480	75.2 %
Ours	0.026	0.1165	77.9 %
Inference Time (per timestep)	-	-	-

StyleDiffusion は定性的・定量的指標でベースラインよりも再構成と編集の精度を高める。
注意正則化は再構成の忠実性を向上させ、クロスアテンションマップを DDIM inversion のものと整合させる。
自己注意を unconditional ブランチにも注入する P2Plus 編集は、P2P より大規模な構造変更の取り扱いを改善する。
100 枚のデータセットで、StyleDiffusion は Structure-dist と NS-LPIPS スコアで最良を示し、ベースラインと比較して Clipscore も競合的。
再構成時の PSNR/SSIM を高く維持しつつ、推論時間のオーバーヘッドは控えめで済む。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。