QUICK REVIEW

[논문 리뷰] MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Omer Bar-Tal, Lior Yariv|arXiv (Cornell University)|2023. 02. 16.

Generative Adversarial Networks and Image Synthesis인용 수 65

한 줄 요약

MultiDiffusion는 사전 학습된 모델의 여러 확산 경로를 융합하여 학습 없이 통합된 생성 프로세스를 도입하고, 파노라마 및 영역 기반 프롬프트와 같은 제어 가능한 이미지 생성을 미세 조정 없이 가능하게 합니다.

ABSTRACT

Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. Project webpage: https://multidiffusion.github.io

연구 동기 및 목표

비용이 많이 드는 재학습이나 미세 조정 없이 제어 가능한 텍스트-투-이미지 생성을 촉진한다.
공유 제약을 통해 여러 확산 경로를 묶는 통일된 프레임워크를 제안한다.
화면 비율 확장(파노라마)과 영역 기반 프롬프트 적용 가능성을 시연한다.
고정된 참조 모델을 활용하면서도 고품질의 일관된 출력을 달성함을 보인다.

제안 방법

타깃 이미지 공간 J에서 작동하고 사전 학습된 확산 모델 Phi와 매개변수를 공유하는 MultiDiffusion 프로세스 Psi를 정의한다.
여러 영역/조건화된 디노이징 단계를 조정하는 최소제곱 추적-확산 경로(FTD) 목적함수를 형식화한다: L_FTD(J|J_t,z)=sum_i || W_i ⊗ [F_i(J)−Phi(I_t^i|y_i)] ||^2.
F_i가 간단한 픽셀 크롭일 때 Psi에 대한 닫힌 형태의 LS 해를 얻어 각 단계 업데이트를 효율적으로 가능하게 한다.
참조 모델과의 연결을 위해 F_i: J→I 및 lambda_i: Z→Y 매핑을 도입하여 타깃 영역과 조건을 연결한다.
영역 기반 생성 동안 타이트한 영역 제약에 대한 충실도를 높이기 위해 부트스트래핑과 영역 마스크를 적용한다.

실험 결과

연구 질문

RQ1사전 학습된 확산 모델을 훈련이나 미세 조정 없이 새로운 생성 작업으로 조종할 수 있는가?
RQ2다른 영역이나 종횡비에 해당하는 여러 확산 경로를 하나의 일관된 생성 단계로 어떻게 조화시킬 수 있는가?
RQ3제어 가능한 생성을 가능하게 하기 위한 목표 이미지 공간과 참조 모델의 공간 간의 효과적인 매핑은 무엇인가?
RQ4이 방법이 파노라마 및 영역 기반 프롬프트에 대한 작업 특화 기초기의 경쟁력 있는 품질과 일관성을 제공하는가?

주요 결과

방법	FID	CLIP-점수	CLIP-미학
Stable Diffusion	6.05±3.1	0.27	6.36
SI	45.5±14.5	0.26	5.76
BLD	18.4±7.4	0.27	6.02
Ours	10.3±4.8	0.27	6.36

MultiDiffusion은 서로 독립적으로 다루는 대신 크롭 간의 확산 경로를 융합함으로써 고품질의 일관된 파노라마를 제공한다.
Region-based generation with masks and rough prompts achieves better IoU on COCO than SI and BLD baselines (with bootstrap improvements).
Panorama experiments show improved FID, CLIP-score, and CLIP-aesthetic over baselines, indicating better distributional similarity and perceived quality.
The method achieves state-of-the-art-like performance on tasks without any training or fine-tuning of the reference model.
Bootstrapping improves fidelity to tight masks, yielding higher IoU scores on COCO evaluations.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.