QUICK REVIEW

[論文レビュー] Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Shihao Zhao, Dongdong Chen|arXiv (Cornell University)|May 25, 2023

Mycobacterium research and diagnosis被引用数 65

ひとこと要約

Uni-ControlNet は、事前学習済みのテキスト-to-image 拡散モデルに対して、複数の局所およびグローバル制御を可能にする統一的な二アダプター・フレームワークを導入し、微調整コストを抑えつつ組み合わせ可能な制御を実現します。局所制御とグローバル制御を分離し、多様な条件に対して強い制御性と生成品質を示します。

ABSTRACT

Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one single model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}.

研究の動機と目的

Motivate adding diverse, fine-grained controls beyond text prompts for T2I diffusion models.
Design a unified, lightweight adapter-based framework that supports multiple local and global controls in one model.
Reduce fine-tuning cost and model size by using only two adapters regardless of the number of controls.
Enable composable control by allowing independent training of local and global adapters that can be combined at inference.
Show improved controllability and image fidelity over existing methods through quantitative and qualitative experiments.

提案手法

Classify controls into local (e.g., edge maps, depth, segmentation) and global (e.g., CLIP image embeddings).
Introduce a shared local condition encoder with multi-scale condition injection via a Feature Denormalization (FDN) module to modulate noise features at multiple resolutions.
Implement a shared global condition encoder to convert global signals into tokens that extend the text prompt and interact through cross-attention in all layers.
Fine-tune only two adapters (one for local, one for global) on frozen pre-trained diffusion models, enabling composable conditioning.
Train the adapters separately with dropout strategies to promote robustness and eventual composability during inference without joint fine-tuning.
During inference, merge adapters and use DDIM sampling with classifier-free guidance; adjust a global weight lambda depending on the presence of text prompts.]
research_questions: [
Can a two-adapter architecture support multiple local and global controls in a single pre-trained T2I diffusion model?
Does separating local and global adapters improve controllability and composability compared to per-condition adapters?
What are effective strategies for injecting local and global condition information to preserve generation fidelity across diverse controls?

実験結果

リサーチクエスチョン

RQ1二つのアダプター構成で、単一の事前学習済み T2I 拡散モデルに対して複数の局所およびグローバル制御をサポートできるか？
RQ2局所アダプターとグローバルアダプターを分離することは、条件ごとのアダプターと比較して制御性と組み合わせ可能性を向上させるか？
RQ3多様な制御条件間で生成忠実度を維持するための、局所およびグローバル条件情報の注入における有効な戦略は何か？

主な発見

Uni-ControlNet は、条件の数に関係なく、二つのアダプターのみを用いて制御性と忠実度の改善を達成します。
提案される局所制御アダプターは、FDN を用いた多スケールの注入によりノイズ特徴を調整し、局所条件との整合性を高めます。
グローバル制御アダプターは、CLIP ベースのエンコーダから導出されるグローバルトークンでプロンプトを拡張し、クロスアテンションを介してグローバルな条件付けを効果的に実現します。
別々に訓練された局所およびグローバルアダプターは、推論時に追加の結合微調整なしで組み合わせ可能で、柔軟な条件の混在を実現します。
定量的な結果は、COCO2017 上で ControlNet、GLIGEN、T2I-Adapter と比較して、複数の制御条件に対して有利なFID スコアを示し、制御性指標も競争力があります。
定性的な結果は、複数条件（局所+グローバル）の整合的な統合と、単一条件およびマルチ条件シナリオでの堅牢な性能を示します。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。