QUICK REVIEW

[논문 리뷰] eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah|arXiv (Cornell University)|2022. 11. 02.

Generative Adversarial Networks and Image Synthesis인용 수 223

한 줄 요약

eDiff-I는 확산 기반 텍스트-이미지 생성의 서로 다른 단계에 특화된 전문가 디노이저 앙상블을 훈련시켜 추론 비용을 증가시키지 않으면서 텍스트 정합성을 향상시키고, 여러 인코더를 활용하며 paint-with-words 기능을 활용한다.

ABSTRACT

Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/

연구 동기 및 목표

확산 기반 텍스트-이미지 생성에서 서로 다른 합성 단계를 포착할 필요성을 동기화한다.
다양한 노이즈 수준에 특화된 전문가 디노이저 앙상블을 제안하여 텍스트 정합성을 개선하고 추론 비용을 유지한다.
다중 컨디셔닝 인코더(T5, CLIP 텍스트, CLIP 이미지)가 컨디셔닝 다양성에 미치는 영향을 조사한다.
훈련 비용이 크게 들지 않는 학습 효율적인 파인튜닝 전략으로 앙상블을 확장한다.
생성의 공간 제어를 위한 훈련 없는 paint-with-words 메커니즘을 소개한다.

제안 방법

기본 확산 모델을 학습시키고 점진적으로 노이즈 레벨 구간에 대응하는 전문 디노이저로 가지를 분기시킨다.
이진 트리 분기 방식을 사용하여 분할된 노이즈 분포에 대해 고/저/중간 구간의 극단치를 중심으로 전문가 모델을 초기화하고 미세조정한다.
다양한 입력 임베딩(T5 텍스트, CLIP 텍스트, CLIP 이미지)을 교차 주의(attention) 및 드롭아웃과 결합하여 다양한 컨디셔닝을 형성한다.
사용자-제공 마스크로 교차 주의(attention)를 조절해 공간적 레이아웃을 제어하는 훈련 없는 paint-with-words 메커니즘을 도입한다.
훈련 시 degradations를 가지는 일련의 확산 모델(기본 64x64, SR256, SR1024)을 배치하여 초해상 단계에서 일반화 성능을 높인다.
COCO 및 Visual Genome에서 제로샷 FID-CLIP 트레이드오프를 평가하고, 최첨단 기반 모델과의 비교를 수행한다.

실험 결과

연구 질문

RQ1전문가 디노이저 앙상블이 추론 비용을 증가시키지 않으면서 텍스트-이미지 정합성을 개선하는가?
RQ2다중 컨디셔닝 인코더(T5, CLIP 텍스트, CLIP 이미지)가 이미지 품질 및 스타일 전이 능력에 어떤 영향을 미치는가?
RQ3훈련 없는 paint-with-words 메커니즘이 생성 결과에 대한 실용적인 공간 제어를 제공하는가?
RQ4eDiff-I의 성능 향상이 표준 텍스트-이미지 벤치마크에서 단일 모델 기반보다 큰가?

주요 결과

모델	파라미터 수	제로샷 FID
GLIDE	0.5B	12.24
Make-A-Scene	0.4B	11.84
DALL·E 2	6.5B	10.39
Stable Diffusion	1.4B	8.59
Imagen	7.9B	7.27
Parti	20B	7.23
eDiff-I-Config-A	6.8B	7.35
eDiff-I-Config-B	7.1B	7.26
eDiff-I-Config-C	8.1B	7.11
eDiff-I-Config-D	9.1B	6.95

2-전문가 앙상블은 COCO 및 Visual Genome 데이터셋에서 베이스라인보다 일관되게 FID-CLIP 트레이드오프를 개선한다.
eDiff-I는 단일 모델 확산과 비교해 추론 비용을 유지하면서 경쟁력 있는 제로샷 FID를 달성한다.
T5와 CLIP 텍스트 인코더의 조합이 최상의 성능을 제공하며, CLIP 이미지 임베딩은 스타일 전이를 가능하게 한다.
paint-with-words는 교차 주의(attention)를 사용자가 제공한 마스크로 조절하여 추가 학습 없이 공간 제어를 제공한다.
효율적인 학습 분기(공유 베이스에서 왼쪽/오른쪽/고노이즈 극단 및 중간 전문가로 세분화)는 학습 비용을 줄이면서 용량을 확장한다.
대형 베이스라인과 비교할 때, eDiff-I 변형(Config A-D)은 제로샷 FID를 점진적으로 향상시키며, 보고된 설정에서 Config D는 6.95의 제로샷 FID를 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.