QUICK REVIEW

[논문 리뷰] Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Xiaoliang Dai, Ji Hou|arXiv (Cornell University)|2023. 09. 27.

Generative Adversarial Networks and Image Synthesis인용 수 30

한 줄 요약

Emu는 소수의 고미학 이미지 세트를 사용한 사전 학습된 텍스트-투-이미지 모델의 퀄리티 튜닝이 시각적 매력을 크게 향상시키면서 일반성을 보존하고, 시각 미학에서 SDXLv1.0을 능가함을 보여준다.

ABSTRACT

Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred $68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.

연구 동기 및 목표

사전 학습을 넘어 텍스트-투-이미지 생성에서 더 나은 미적 정렬을 촉진한다.
작고 수작업으로 큐레이션된 고품질 데이터셋이 이미지 미학을 상당히 향상시킬 수 있음을 입증한다.
퀄리티 튜닝이 도메인 간 시각적 개념의 일반성을 보존한다는 것을 보여준다.
퀄리티 튜닝의 이점이 다른 아키텍처로도 전이 가능하다는 증거를 제공한다.

제안 방법

1.1십억 개의 이미지-텍스트 쌍에 대해 잠재 확산 모델(LDM)을 사전 학습한다.
사진 원칙에 따라 수작업 및 자동 필터링으로 2,000장의 고품질 미세 조정 세트를 큐레이션한다.
배치 크기 64와 0.1의 노이즈 오프셋으로 최대 15k 반복학습으로 모델을 미세 조정한다.
일반성 검증을 위해 대체 아키텍처(픽셀 확산 및 마스킹된 생성 트랜스포머)에 퀄리티 튜닝을 적용한다.
PartiPrompts와 Open User Input 프롬프트에서 인간 선호도 기준으로 미적 평가를 수행하고 시각적 매력과 텍스트 충실도에 집중한다.

실험 결과

연구 질문

RQ1매우 작고 고품질의 미세 조정 데이터 세트가 개념 커버리지를 손실하지 않으면서 사전 학습된 텍스트-투-이미지 모델을 더 높은 시각적 매력으로 이끌 수 있는가?
RQ2퀄리티 튜닝이 잠재 확산 모델을 넘어 다른 모델 아키텍처에서도 전이 가능한가?
RQ3시각적 매력과 텍스트 프롬프트와의 정렬 측면에서 퀄리티 튜닝이 사전 학습과 어떻게 비교되는가?

주요 결과

평가 데이터	win (%)	tie (%)	lose (%)
Parti (All)	68.4	2.1	29.5
OUI (All)	71.3	1.2	27.5
Parti (Stylized)	81.7	1.9	16.3
OUI (Stylized)	75.5	1.4	23.1

Emu는 PartiPrompts에서 사전 학습된 대안에 비해 시각적 매력에서 82.9%의 승률을, Open User Input 프롬프트에서 91.2%의 승률을 달성한다.
Emu는 시각적 매력에서 SDXLv1.0보다 Parti All에서 68.4%, OUI All에서 71.3% 선호된다.
퀄리티 튜닝은 텍스트 충실도도 향상시킨다(각각 PartiPrompts에서 36.7%, OUI에서 47.9%).
스타일화된 프롬프트에서도 유사한 이익을 얻으며, SDXLv1.0은 시각적 매력과 스타일화된 하위집합 모두에서 능가한다.
퀄리티 튜닝은 다른 아키텍처(픽셀 확산 및 마스킹된 생성 트랜스포머)에 대해도 시각적 매력과 텍스트 충실도의 개선으로 효과적이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.