QUICK REVIEW

[논문 리뷰] UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Can Qin, Shu Zhang|arXiv (Cornell University)|2023. 05. 18.

Multimodal Machine Learning Applications인용 수 24

한 줄 요약

UniControl은 여러 시각적 조건 생성 태스크를 단일 확산 모델로 통합하여 보지샷(unseen) 시각 조건에 대한 제로샷 적응을 가능하게 하고, 단일 태스크 기준선보다 성능이 더 우수하면서도 효율성을 유지합니다.

ABSTRACT

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.

연구 동기 및 목표

언어 프롬프트와 다양한 시각 조건을 모두 다루는 제어 가능한 이미지 생성을 위한 통합 프레임워크의 필요성과 동기를 제시한다.
효율성과 품질을 높이기 위해 태스크 간 지식을 공유하는 메커니즘을 개발한다.
보지 못한 태스크와 조건 모달리에 대한 제로샷 일반화를 가능하게 한다.
여러 제어 가능한 태스크로 확장하면서 모델 크기를 줄인다.
다중 조건 시각 생성에 대한 데이터셋과 벤치마크를 제공한다.

제안 방법

다양한 시각 조건에서 저수준 특징을 포착하기 위한 Mixture-of-Experts(MOE) 스타일 어댑터를 도입한다.
언어 프롬프트에서 도출된 태스크 조건 임베딩을 통해 ControlNet을 조절하는 태스크 인지형 HyperNet을 개발한다.
훈련을 재구성하여 K개의 태스크와 태스크 지시를 결합하고, 조건 간의 통합 학습을 가능하게 한다.
9개 태스크에 걸쳐 2000만 개의 이미지-텍스트-조건 트리플렛으로 구성된 MultiGen-20M 데이터셋에서 학습한다.
입력 시각 조건의 제어성을 높이기 위해 classifier-free 가이던스를 적용한다.
보지 못한 태스크와 하이브리드 조건 조합에 대한 제로샷 일반화를 시연한다.

실험 결과

연구 질문

RQ1하나의 확산 모델이 언어 프롬프트와 다중 시각 조건-이미지 태스크를 함께 학습하고 일반화할 수 있는가?
RQ2MOE 스타일 어댑터와 태스크 인지 HyperNet이 관련 및 보지 못한 조건 간의 다중 태스크 학습과 제로샷 이전에 효과적인가?
RQ3 unified 모델은 다양한 C2I 태스크에서 품질과 효율성 면에서 태스크-특정 베이스라인과 어떻게 비교되는가?
RQ4태스크별 재학습 없이도 하이브리드 또는 보지 못한 시각 조건 하에서 얼마나 정확하게 생성할 수 있는가?
RQ5다중 태스크 제어 가능한 확산 모델의 학습과 평가를 가장 잘 뒷받침하는 데이터셋과 벤치마크는 무엇인가?

주요 결과

UniControl은 여러 태스크에서 태스크-특정 제어보다 우수한 성능을 보이면서도 컴팩트한 모델(~1.5B 파라미터)을 유지한다.
MOE-스타일 어댑터와 태스크 인지 HyperNet이 성능을 크게 향상시키며, 제거 실험에서 전체 모델이 최상의 FID 점수를 냈다.
제로샷 일반화는 명시적 학습 없이도 보지 못한 태스크 및 하이브리드 조건 조합을 처리할 수 있게 한다.
정성적 결과는 에지, 세분화, 깊이, 법선, 포즈, 아웃페인팅 등 태스크에서 시각 조건과 언어 프롬프트와의 정렬이 개선되었음을 보여준다.
사용자 연구는 UniControl이 여러 태스크에서 재구현된 단일 태스크 제어를 일반적으로 능가하는 것으로 나타났다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.