QUICK REVIEW

[論文レビュー] MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Omer Bar-Tal, Lior Yariv|arXiv (Cornell University)|Feb 16, 2023

Generative Adversarial Networks and Image Synthesis被引用数 65

ひとこと要約

MultiDiffusion は、事前学習済みモデルからの複数の拡散パスを統合して訓練なしで生成プロセスを統合し、ファインチューニングなしでパンオラマや領域ベースのプロンプトを含む、制御可能な画像生成を実現します。

ABSTRACT

Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. Project webpage: https://multidiffusion.github.io

研究の動機と目的

高価な再訓練やファインチューニングを必要とせず、制御可能なテキストから画像生成を動機づける。
共有制約を介して複数の拡散パスを結ぶ統一的なフレームワークを提案する。
アスペクト比拡張（パノラマ）と領域ベースのプロンプティングへの適用可能性を実証する。
固定されたリファレンスモデルを活用しつつ高品質で一貫性のある出力を達成することを示す。

提案手法

ターゲット画像空間 J 上で動作し、事前訓練済み拡散モデル Phi とパラメータを共有する MultiDiffusion プロセス Psi を定義する。
最小二乗に基づくフォロー・ザ・ディフュージョン・パス (FTD) 目的を定式化し、複数の領域/条件付きデノイジングステップを調和させる: L_FTD(J|J_t,z)=sum_i || W_i ⊗ [F_i(J)−Phi(I_t^i|y_i)] ||^2.
F_i が単純なピクセルクロップである場合、Psi の閉形式 LS 解を得て、各ステップの更新を効率化する。
ターゲット領域と条件をリファレンスモデルに結びつけるために、写像 F_i: J→I と lambda_i: Z→Y を導入して、ターゲット領域と条件をリファレンスモデルに結びつける。
領域ベースの生成時に、ブートストラッピングと領域マスクを適用して、厳密な領域制約への適合性を高める。

実験結果

リサーチクエスチョン

RQ1事前訓練済みの拡散モデルを、訓練やファインチューニングなしに新しい生成タスクに向けて操縦できるか。
RQ2異なる領域やアスペクト比に対応する複数の拡散パスを、どのようにして単一の一貫した生成ステップに統合できるか。
RQ3制御可能な生成を実現するために、ターゲット画像空間とリファレンスモデルの空間との間の有効なマッピングとは何か。
RQ4このアプローチは、パノラマと領域ベースのプロンプトに対するタスク固有のベースラインと比べて、品質と一貫性で競争力があるか、あるいは優れているか。

主な発見

Method	FID	CLIP-score	CLIP-aesthetic
Stable Diffusion	6.05±3.1	0.27	6.36
SI	45.5±14.5	0.26	5.76
BLD	18.4±7.4	0.27	6.02
Ours	10.3±4.8	0.27	6.36

MultiDiffusion は、複数のクロップを独立して扱うのではなく、クロップを横断して拡散パスを融合することにより、高品質で一貫性のあるパノラマを提供する。
Region-based generation with masks and rough prompts achieves better IoU on COCO than SI and BLD baselines (with bootstrap improvements).
Panorama experiments show improved FID, CLIP-score, and CLIP-aesthetic over baselines, indicating better distributional similarity and perceived quality.
The method achieves state-of-the-art-like performance on tasks without any training or fine-tuning of the reference model.
Bootstrapping improves fidelity to tight masks, yielding higher IoU scores on COCO evaluations.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。