[논문 리뷰] Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Uni-ControlNet은 사전 학습된 텍스트-투-이미지 확산 모델에 대해 다중 로컬 및 글로벌 제어를 가능하게 하는 단일화된 두 어댑터 프레임워크를 도입하여, 보정 비용을 줄인 채로 구성 가능한 제어를 달성합니다. 로컬 제어와 글로벌 제어를 분리하고, 다양한 조건에서 강력한 제어 가능성과 생성 품질을 증명합니다.
Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one single model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}.
연구 동기 및 목표
- Motivate adding diverse, fine-grained controls beyond text prompts for T2I diffusion models.
- Design a unified, lightweight adapter-based framework that supports multiple local and global controls in one model.
- Reduce fine-tuning cost and model size by using only two adapters regardless of the number of controls.
- Enable composable control by allowing independent training of local and global adapters that can be combined at inference.
- Show improved controllability and image fidelity over existing methods through quantitative and qualitative experiments.
제안 방법
- Classify controls into local (e.g., edge maps, depth, segmentation) and global (e.g., CLIP image embeddings).
- Introduce a shared local condition encoder with multi-scale condition injection via a Feature Denormalization (FDN) module to modulate noise features at multiple resolutions.
- Implement a shared global condition encoder to convert global signals into tokens that extend the text prompt and interact through cross-attention in all layers.
- Fine-tune only two adapters (one for local, one for global) on frozen pre-trained diffusion models, enabling composable conditioning.
- Train the adapters separately with dropout strategies to promote robustness and eventual composability during inference without joint fine-tuning.
- During inference, merge adapters and use DDIM sampling with classifier-free guidance; adjust a global weight lambda depending on the presence of text prompts
실험 결과
연구 질문
- RQ1Can a two-adapter architecture support multiple local and global controls in a single pre-trained T2I diffusion model?
- RQ2Does separating local and global adapters improve controllability and composability compared to per-condition adapters?
- RQ3What are effective strategies for injecting local and global condition information to preserve generation fidelity across diverse controls?
주요 결과
- Uni-ControlNet achieves controllability and fidelity improvements while using only two adapters, regardless of the number of conditions.
- The proposed local control adapter uses multi-scale injection with FDNs to modulate noise features, leading to better alignment with local conditions.
- The global control adapter extends the prompt with global tokens derived from a CLIP-based encoder, enabling effective global conditioning through cross-attention.
- Separately trained local and global adapters can be composed at inference time without additional joint fine-tuning, enabling flexible condition mixtures.
- Quantitative results show favorable FID scores across several controls compared with ControlNet, GLIGEN, and T2I-Adapter on COCO2017, with competitive controllability metrics.
- Qualitative results demonstrate coherent integration of multiple conditions (local+global) and robust performance across single and multi-condition scenarios
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.