QUICK REVIEW

[論文レビュー] Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Xuhui Jia, Yang Zhao|arXiv (Cornell University)|Apr 5, 2023

Generative Adversarial Networks and Image Synthesis被引用数 21

ひとこと要約

この論文は、テスト時最適化なしに単一の画像から対象物に焦点を当てたエンコーダと事前学習済み拡散モデル上の正則化付き結合訓練戦略を用いて、対象オブジェクトの個別化画像生成を可能にするフレームワークを提示します。

ABSTRACT

This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme that become a critical piece in fostering object specific embedding faithfully reflected into the generation process, while keeping control and editing abilities. Once trained, the network is able to produce diverse content and styles, conditioned on both texts and objects. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity, without the need of test-time optimization. Systematic studies are also conducted to analyze our models, providing insights for future work.

研究の動機と目的

個々のオブジェクトごとの微調整を必要とせず、スケーラブルな個別化画像合成を動機づける。
事前学習済みのテキスト-画像モデルを条件付けるオブジェクト埋め込みフレームワークを開発する。
オブジェクト埋め込みを統合しつつ、編集機能とアイデンティティ忠実度を維持する。
キャプショニングによるデータ拡張を提案し、オブジェクト特異的な生成を改善する。
ストレージ/計算コストを削減しつつ、多様なスタイルとオブジェクトで1回推論生成を実証する。

提案手法

オブジェクト埋め込みで条件付けするために、凍結された事前学習済み拡散モデルにクロスアテンションモジュールを挿入する。
埋め込みのために、凍結されたCLIP画像エンコーダ（オブジェクトエンコーダ）と凍結されたT5-XXLテキストエンコーダを使用する。
編集可能性とオブジェクト忠実度を保つために、クロスリファレンス正則化を用いた正則化付き結合訓練スキームを適用する。
オブジェクトアイデンティティを背景から分離するためのオブジェクト埋め込みマスキングを実装する。
PaLIと属性分類器を用いて記述的キャプションを生成し、ドメイン固有の訓練キャプション（自動キャプション付）を作成する。
追加のアテンションだけでなく、全體のネットワークをエンドツーエンドで訓練して、オブジェクト埋め込みの効果的な利用を可能にする。

実験結果

リサーチクエスチョン

RQ1テスト時最適化なしで、単一のオブジェクト埋め込みで個別化生成は十分か。
RQ2言語ガイド付き編集能力を失わずに、オブジェクト埋め込みを事前学習済み拡散モデルと統合するにはどうすべきか。
RQ3オブジェクトのアイデンティティを保持しつつ、テキストによる制御性を維持する訓練戦略は何か。
RQ4自動キャプショニングはオブジェクト特異的な合成の品質と多様性を改善するか。
RQ5頑健な個別化のために、高レベルのオブジェクト概念を最もよく捉えるエンコーダの選択はどれか。

主な発見

提案手法は、単一のフォワードパスで高品質な個別化画像を提供し、アイデンティティの保持とプロンプト整合性でTextual Inversion、DreamBooth、InstructPix2Pixを上回る。
CLIPベースのオブジェクト埋め込みは、VGGベースのエンコーダよりもアイデンティティの保持と外観の変化を向上させる。
クロスリファレンス正則化は、オブジェクトアイデンティティを画像固有の手掛かりから分離することで、アイデンティティ忠実度と多様性を向上させる。
追加のアテンション層だけを訓練するより、全体ネットワークのファインチューニングがアイデンティティ保持に優れる。
自動キャプショニングは一般データセットとドメイン特化データセット間のドメインギャップを橋渡し、テキスト–画像の整合性とオブジェクト忠実度を向上させる。
このアプローチは、オブジェクトごとの最適化を回避し、オブジェクト数に関係なく一定のストレージコストを維持するため、効率的でスケーラブルなままである。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。