QUICK REVIEW

[论文解读] UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Can Qin, Shu Zhang|arXiv (Cornell University)|May 18, 2023

Multimodal Machine Learning Applications被引用 24

一句话总结

UniControl 将多项视觉条件生成任务统一为单一扩散模型，使其对未见视觉条件具备零-shot 泛化，并在保持高效的同时超越单任务基线。

ABSTRACT

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.

研究动机与目标

提出一个统一的可控图像生成框架，能够同时处理语言提示和多样化的视觉条件。
开发跨任务共享知识的机制，以提升效率和质量。
实现对未知任务和条件模态的零-shot 泛化。
在扩展到多任务可控的同时，降低模型规模。
提供用于多条件视觉生成的数据集和基准。

提出的方法

引入一种 Mixture-of-Experts（MOE）风格的适配器，用以捕捉来自多样视觉条件的低级特征。
开发一个任务感知的 HyperNet，通过来自语言提示的任务条件嵌入来调制 ControlNet。
将训练重新表述为将 K 个任务与任务指令结合起来，以实现跨条件的统一学习。
在 MultiGen-20M 上进行训练，该数据集包含跨九个任务的 20M 图像-文本-条件三元组。
应用无分类器引导（classifier-free guidance）以增强输入视觉条件的可控性。
展示对未见任务与混合条件组合的零-shot 泛化能力。

实验结果

研究问题

RQ1单一扩散模型是否能够在语言提示的同时，学习并对多种视觉条件到图像的任务进行泛化？
RQ2MOE 风格的适配器和任务感知 HyperNet 是否能够实现有效的多任务学习，以及在相关与未见条件之间的零-shot 迁移？
RQ3在多样的 C2I（条件到图像）任务中，统一模型在质量和效率方面与任务特定基线相比如何？
RQ4在混合或未见的视觉条件下，模型在不进行任务特定再训练的情况下能够多大程度上生成准确？
RQ5哪些数据集和基准最适合支持多任务可控扩散模型的训练与评估？

主要发现

UniControl 在若干任务上超越了任务特定的控件，同时保持紧凑的模型规模（约 1.5B 参数）。
MOE 风格的适配器与任务感知 HyperNet 显著提升性能；消融研究表明，完整模型获得最佳的 FID 分数。
零-shot 泛化使其能够在无需显式训练的情况下处理未见任务与混合条件组合。
定性结果显示在边缘、分割、深度、法线、姿态和修补等任务上，与视觉条件和语言提示的对齐有显著改进。
用户研究表明，UniControl 在多项任务上通常优于重新实现的单任务控件。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。