QUICK REVIEW

[논문 리뷰] Subject-driven Text-to-Image Generation via Apprenticeship Learning

Wenhu Chen, Hexiang Hu|arXiv (Cornell University)|2023. 04. 01.

Multimodal Machine Learning Applications인용 수 46

한 줄 요약

SuTI는 수천 개의 주제별 전문가 모델을 흉내 내기 위해 단일 견습(diffusion) 모델을 학습시켜 테스트 시 미세조정 없이도 맥락 안에서 주제 기반의 이미지 생성을 즉시 가능하게 한다. 이는 강한 충실도와 속도를 달성하며, 여러 지표에서 DreamBooth를 능가한다.

ABSTRACT

Recent text-to-image generation models like DreamBooth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an ``expert model'' for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present SuTI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with in-context learning. Given a few demonstrations of a new subject, SuTI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. SuTI is powered by apprenticeship learning, where a single apprentice model is learned from data generated by a massive number of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train a massive number of expert models, each specializing in a different subject. The apprentice model SuTI then learns to imitate the behavior of these fine-tuned experts. SuTI can generate high-quality and customized subject-specific images 20x faster than optimization-based SoTA methods. On the challenging DreamBench and DreamBench-v2, our human evaluation shows that SuTI significantly outperforms existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen and DreamBooth, especially on the subject and text alignment aspects.

연구 동기 및 목표

개별 주제별 미세조정 없이 효율적이고 확장 가능한 주제 주도 이미지 생성을 추진한다.
단일 견습 모델로 많은 전문가 모델을 흉내 내리기 위해 apprenticeship learning을 활용한다.
처음 보지 않는 주제와 구성에 대해 맥락 내 생성 가능성을 Demonstration 몇 개로 확보한다.
DreamBench 및 DreamBench-v2를 자동 및 인간 평가 지표로 비교한다.

제안 방법

채굴된 이미지-텍스트 클러스터에서 주제별 전문가 확산 모델을 다수 학습한다.
전문가 출력으로부터 의사 타깃을 합성해 단일 견습 확산 모델을 학습한다.
Delta CLIP 필터링을 사용해 견습 학습에 고품질의 전문가 출력을 보장한다.
추론 시 최적화 없이 3-5개의 맥락 내 Demonstration으로 새 이미지를 생성한다.
전문가와 견습 모델의 분산 TPU 기반 병렬 미세조정으로 학습 확장을 수행한다.
CLIP-DINO/CLIP-I/CLIP-T와 인간 평가를 사용해 베이스라인과 비교한다.

실험 결과

연구 질문

RQ1하나의 견습 확산 모델이 테스트 시 미세조정 없이도 보지 않은 주제와 구성으로 일반화할 수 있는가?
RQ2맹락 내 Demonstration 수가 주제 충실도와 텍스트 정렬에 어떤 영향을 미치는가?
RQ3데이터 품질 필터링(Delta CLIP)이 최종 생성 성능에 미치는 영향은 무엇인가?
RQ4SuTI가 DreamBooth 및 다른 주제 주도 방식과 DreamBench 및 DreamBench-v2에서 어떻게 비교되는가?

주요 결과

Method	Backbone	DINO ↑	CLIP-I ↑	CLIP-T ↑
Real Image (Oracle)	-	0.774	0.885	-
DreamBooth	Imagen (1)	0.696	0.812	0.306
DreamBooth	SD (21)	0.668	0.803	0.305
Textual Inversion	SD (21)	0.569	0.780	0.255
Re-Imagen	Imagen (1)	0.600	0.740	0.270
Ours: SuTI	Imagen (1)	0.741	0.819	0.304

unseen 주제에 대해 3-5 Demonstration으로도 즉시, 맥락 내 생성을 수행하며 주제별 최적화가 필요하지 않다.
DreamBench에서 SuTI의 DINO 점수는 0.741, CLIP-I은 0.819, CLIP-T는 0.304로, DreamBooth의 DINO를 능가하고 CLIP-T를 매칭한다.
DreamBench-v2에 대한 인간 평가에서 SuTI는 DreamBooth보다 전체적으로 5% 더 우수하고, 다른 베이스라인보다 최소 30% 이상 우수하다.
Delta CLIP 필터링 품질은 성능에 결정적으로 영향을 미친다; 더 높은 임계값이 훈련 세트가 작아져도 인간 점수를 향상시킨다.
Dream-SuTI(DreamBooth과 마찬가지로 주제 이미지로 미세조정된 모델)가 품질을 추가로 향상시켜 SuTI 및 DreamBooth보다 더 높은 전체 점수를 달성한다.
추론 시 SuTI는 주제당 약 20초 정도 소요되며 3-5 Demonstration으로도 작동하고, 많은 미세조정 방식보다 메모리 footprint가 작다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.