QUICK REVIEW

[論文レビュー] UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Can Qin, Shu Zhang|arXiv (Cornell University)|May 18, 2023

Multimodal Machine Learning Applications被引用数 24

ひとこと要約

UniControlは複数の視覚条件付き生成タスクを1つの拡散モデルに統合し、未知の視覚条件へのゼロショット適応を可能にすると同時に、効率を維持しつつ単一タスクのベースラインを上回ります。

ABSTRACT

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.

研究の動機と目的

言語プロンプトと多様な視覚条件の両方を扱う、制御可能な画像生成の統一フレームワークを提唱する。
効率と品質を向上させるために、タスク間で知識を共有する仕組みを開発する。
未知のタスクや条件モダリティに対するゼロショット一般化を可能にする。
複数の制御可能タスクへスケールしつつ、モデルサイズを縮小する。
複数条件視覚生成のためのデータセットとベンチマークを提供する。

提案手法

さまざまな視覚条件から低レベル特徴を捉えるMOEスタイルのアダプターを導入する。
言語プロンプトから派生したタスク条件付き埋め込みを介してControlNetを調整する、タスク認識型HyperNetを開発する。
訓練を再構成してKタスクとタスク指示を組み合わせ、条件横断の統一学習を可能にする。
9タスクにまたがる20Mの画像-テキスト-条件のトリプレットからなるMultiGen-20Mで訓練する。
入力視覚条件の可制御性を高めるために、分類子なしガイダンスを適用する。
未見のタスクやハイブリッド条件の組み合わせへのゼロショット一般化を実証する。

実験結果

リサーチクエスチョン

RQ1単一の拡散モデルは、言語プロンプトと並行して、複数の視覚条件-to-画像タスクを学習・一般化できるか？
RQ2MOEスタイルのアダプターとタスク認識HyperNetは、関連する条件間および未知の条件間で効果的なマルチタスク学習とゼロショット転送を可能にするか？
RQ3多様なC2Iタスクにおいて、統一モデルは品質と効率の点でタスク固有のベースラインとどう比較されるか？
RQ4タスク固有の再訓練なしで、ハイブリッドまたは未知の視覚条件下でどの程度正確に生成できるか？
RQ5マルチタスク制御拡散モデルの訓練と評価を最も支援するデータセットとベンチマークは何か？

主な発見

UniControlは、いくつかのタスクでタスク固有のコントロールを上回りつつ、コンパクトなモデル（約1.5Bパラメータ）を維持する。
MOEスタイルのアダプターとタスク認識HyperNetは性能を大幅に改善し、アブレーションにより全体モデルが最良のFIDスコアを示す。
ゼロショット一般化により、未見タスクやハイブリッド条件の組み合わせを明示的な訓練なしで扱える。
エッジ、セグメンテーション、深度、法線、ポーズ、アウトペインティング等のタスクで、視覚条件と言語プロンプトとの整合性が向上していることを定性的に示す。
ユーザー調査は、UniControlが複数タスクで再実装された単一タスク制御を概ね上回ることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。