QUICK REVIEW

[論文レビュー] Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach

Yufan Zhou, Ruiyi Zhang|arXiv (Cornell University)|May 23, 2023

Video Analysis and Summarization被引用数 8

ひとこと要約

ProFusion は PromptNet と Fusion Sampling により正則化不要のフレームワークを導入し、単一画像から大規模なテキスト-to-画像モデルをカスタマイズ。微細なディテールを保持しつつ学習を高速化。

ABSTRACT

Recent text-to-image generation models have demonstrated impressive capability of generating text-aligned images with high fidelity. However, generating images of novel concept provided by the user input image is still a challenging task. To address this problem, researchers have been exploring various methods for customizing pre-trained text-to-image generation models. Currently, most existing methods for customizing pre-trained text-to-image generation models involve the use of regularization techniques to prevent over-fitting. While regularization will ease the challenge of customization and leads to successful content creation with respect to text guidance, it may restrict the model capability, resulting in the loss of detailed information and inferior performance. In this work, we propose a novel framework for customized text-to-image generation without the use of regularization. Specifically, our proposed framework consists of an encoder network and a novel sampling method which can tackle the over-fitting problem without the use of regularization. With the proposed framework, we are able to customize a large-scale text-to-image generation model within half a minute on single GPU, with only one image provided by the user. We demonstrate in experiments that our proposed framework outperforms existing methods, and preserves more fine-grained details.

研究の動機と目的

大規模なテキスト-to-画像モデルの個別化カスタマイズを正則化によるディテール損失なしに動機付ける。
入力画像を Stable Diffusion 2 と互換性のあるテキスト埋め込みへ写像するエンコーダー（PromptNet）を提案する。
推論時に画像由来の埋め込みとユーザーのテキストを共同条件付けする Fusion Sampling を導入する。
本手法が基準法と比較して微細なディテールを保持し、プロンプト遵守性を改善することを示す。

提案手法

PromptNet は Stable Diffusion 2 のテキストエンコーダ空間内で入力画像を S* という語彙埋め込みへ符号化する。
Fusion Sampling は分類器なしサンプリングを変更し、S* とユーザー文本 C の両方を条件付けする。これは段階的なプロセス（フュージョンと任意のリファインメント）により行われる。
Fusion ステージでは x_t を S* および C に関する対数確率の勾配を用いて更新し、 Langevin 的なステップを形成する。
リファインメント・ステージは融合情報をさらに統合し、画像品質を向上させるために用いることができる。
DDIM ベースのサンプリング方程式を用いて更新を計算し、プロンプトとの整合性を確保する。
PromptNet の学習中には正則化を一切行わず、画像ディテールを保持する。

実験結果

リサーチクエスチョン

RQ1正則化なしでカスタマイズされたテキスト-to-画像生成は微細なディテールを保持しつつ可能か。
RQ2拡散サンプリング中にユーザー提供の画像由来埋め込み S* と任意のテキスト C の情報をどのように融合して、過適合を避けつつプロンプトを満たすか。
RQ3フュージョン段階とリファインメント段階がディテール保持と構造的一貫性に及ぼす影響は何か。

主な発見

Method	ViT-B/32	ViT-B/16	ViT-L/14	ViT-L/14@336px	RN101	RN50	RN50×4	RN50×16	RN50×64
Stable Diffusion 2	0.271	0.256	0.196	0.196	0.428	0.202	0.355	0.254	0.181
Textual Inversion	0.257	0.251	0.197	0.201	0.426	0.195	0.350	0.247	0.173
DreamBooth	0.283	0.267	0.205	0.210	0.434	0.209	0.363	0.260	0.187
E4T	0.277	0.264	0.203	0.213	0.429	0.206	0.358	0.260	0.191
ProFusion (Ours)	0.293	0.283	0.225	0.229	0.446	0.223	0.374	0.279	0.202

ProFusion は複数の CLIP モデルでの画像-プロンプト類似度がベースラインを上回り、プロンプト遵守性が高いことを示す。
生成画像と入力画像間の同一性類似度が、いくつかの顔認識モデルで ProFusion により向上。
人間の評価者は MTurk 実験で ProFusion をベースライン手法より好意的に評価。
Fusion Sampling は入力ディテールの保持と望ましいプロンプトの達成において、バニラの classifier-free サンプリングを一貫して上回る。
ファインチューニング時のデータ拡張が性能をさらに向上。
アブレーション研究では、フュージョンとリファインメントの両方が品質に有意な寄与をする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。