QUICK REVIEW

[論文レビュー] Subject-driven Text-to-Image Generation via Apprenticeship Learning

Wenhu Chen, Hexiang Hu|arXiv (Cornell University)|Apr 1, 2023

Multimodal Machine Learning Applications被引用数 46

ひとこと要約

SuTIは、数千の被験者特異の専門モデルを模倣する単一のアプレンティス拡散モデルを訓練し、テスト時微調整なしで即時の文脈内被験者主導の画像生成を実現します。高い忠実度と高速性を達成し、いくつかの指標でDreamBoothを上回ります。

ABSTRACT

Recent text-to-image generation models like DreamBooth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an ``expert model'' for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present SuTI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with in-context learning. Given a few demonstrations of a new subject, SuTI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. SuTI is powered by apprenticeship learning, where a single apprentice model is learned from data generated by a massive number of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train a massive number of expert models, each specializing in a different subject. The apprentice model SuTI then learns to imitate the behavior of these fine-tuned experts. SuTI can generate high-quality and customized subject-specific images 20x faster than optimization-based SoTA methods. On the challenging DreamBench and DreamBench-v2, our human evaluation shows that SuTI significantly outperforms existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen and DreamBooth, especially on the subject and text alignment aspects.

研究の動機と目的

効率的で拡張性のある被験者主導の画像生成を、個別被験者の微調整なしで実現することを動機づける。
単一のアプレンティスモデルで大規模な専門モデル群を模倣するためにアプレンティスシップ学習を活用する。
少数のデモンストレーションを用いて、未見の被験者と構図の文脈内生成を可能にする。
DreamBenchおよびDreamBench-v2を自動・人間評価の双方で評価する。

提案手法

mined image-textクラスターから、被験者特異の専門拡散モデルを多数訓練する。
専門家の出力から疑似ターゲットを合成して、単一のアプレンティス拡散モデルを訓練する。
デルタ CLIPフィルタリングを用いて、アプレンティス訓練の高品質な専門家出力を保証する。
推論時には、最適化せずに3-5件の文脈内デモンストレーションから新画像を生成する。
専門家とアプレンティスの分散型TPUベースの並列微調整によって訓練を拡張する。
CLIP-DINO/CLIP-I/CLIP-Tと人間評価を用いてベースラインと比較する。

実験結果

リサーチクエスチョン

RQ11つのアプレンティス拡散モデルは、テスト時微調整なしで未見の被験者と構成に一般化できるか。
RQ2文脈内デモンストレーションの数は、被験者の忠実度とテキスト整合性にどのように影響するか。
RQ3データ品質フィルタリング（デルタ CLIP）の影響は最終的な生成性能にどのように影響するか。
RQ4SuTIはDreamBoothおよび他の被験者駆動手法とDreamBenchおよびDreamBench-v2でどう比較されるか。

主な発見

手法	バックボーン	DINO ↑	CLIP-I ↑	CLIP-T ↑
Real Image (Oracle)	-	0.774	0.885	-
DreamBooth	Imagen (1)	0.696	0.812	0.306
DreamBooth	SD (21)	0.668	0.803	0.305
Textual Inversion	SD (21)	0.569	0.780	0.255
Re-Imagen	Imagen (1)	0.600	0.740	0.270
Ours: SuTI	Imagen (1)	0.741	0.819	0.304

SuTIは、未見の被験者に対して3-5件のデモで即時・文脈内生成を実現し、個別被験者ごとの最適化を必要としません。
DreamBenchでは、SuTIはDINO 0.741、CLIP-I 0.819、CLIP-T 0.304を達成し、DINOでDreamBoothを上回り、CLIP-Tでも同等以上を示します。
DreamBench-v2の人間評価では、SuTIはDreamBoothを総合で5%上回り、他のベースラインより少なくとも30%以上優れていることを示しています。
Delta CLIPフィルタリングの品質は性能に決定的な影響を与えます。閾値が高いほど訓練セットが小さくても人間の評価スコアが向上します。
Dream-SuTI（被験者画像で微調整したもの）は品質をさらに向上させ、SuTIおよびDreamBoothよりも高い総合スコアを達成します。
SuTIは推論時に被験者あたり約20秒程度で、3-5件のデモンストレーションを用い、多くの微調整手法よりもメモリ使用量が小さいです。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。