QUICK REVIEW

[논문 리뷰] Scalable Diffusion Models with Transformers

William Peebles, Saining Xie|arXiv (Cornell University)|2022. 12. 19.

Advanced Neuroimaging Techniques and Applications인용 수 39

한 줄 요약

이 논문은 latent diffusion 모델에서 U-Net을 트랜스포머 백본으로 대체하고, Gflops 기반 DiT 모델이 더 좋은 FID를 내며 DiT-XL/2가 256×256 ImageNet에서 최첨단 FID 및 512×512에서 강력한 결과를 달성한다.

ABSTRACT

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

연구 동기 및 목표

확인 Transformer 백본이 diffusion 모델에서 U-Net을 대체해도 성능 손실 없이 작동할 수 있는지 평가한다.
Diffusion Transformers (DiTs)의 확장 동작을 Gflops를 주요 지표로 분석한다.
모델 크기, 패치 크기, conditioning 메커니즘이 샘플 품질에 미치는 영향을 보여준다.
LDM 프레임워크에서 256×256 및 512×512 해상도의 ImageNet에서 DiTs를 평가하여 최첨단 기준선을 확립한다.

제안 방법

latent diffusion 모델(LDMs)에서 U-Net을 Transformer로 대체한다.
패치화 연산을 사용하여 latent 표현을 ViT 스타일의 Transformer 백본의 토큰으로 변환한다.
네 가지 conditioning 전략(in-context tokens, cross-attention, adaLN, adaLN-Zero)을 탐구하고 효율성과 품질 측면에서 adaLN-Zero를 선택한다.
LDM 프레임워크 내에서 ImageNet 256×256 및 512×512에 대해 DiTs를 학습하고 잠재 공간에서 작동하도록 사전 학습된 VAE 인코더/디코더를 사용한다.
FID-50K(250 샘플링 스텝)로 주로 모델 성능을 평가하고 sFID, IS, Precision, Recall로 보조 평가하며 추론 시 EMA 가중치를 사용한다.

Figure 2 : ImageNet generation with Diffusion Transformers (DiTs). Bubble area indicates the flops of the diffusion model. Left: FID-50K (lower is better) of our DiT models at 400K training iterations. Performance steadily improves in FID as model flops increase. Right: Our best model, DiT-XL/2, is

실험 결과

연구 질문

RQ1트랜스포머 백본이 latent diffusion 모델에서 U-Net 기반 확산 모델의 이미지 생성 품질을 대등하거나 능가할 수 있는가?
RQ2트랜스포머의 forward-pass compute(Gflops)를 증가시키면 확산 샘플 품질에 어떤 영향이 있는가?
RQ3DiT에서 어떤 conditioning 메커니즘이 컴퓨트와 샘플 품질 사이의 최적의 타협을 제공하는가?
RQ4LDM 프레임워크에서 학습될 때 256×256 및 512×512 해상도에서 DiTs의 최첨단 성능은 무엇인가?

주요 결과

모델	FID↓	sFID↓	IS↑	Precision↑	Recall↑	Resolution	Gflops(보고된 수치)
DiT-XL/2	2.27	4.60	278?	0.83	0.57	256×256	118.6
Diag?	3.04	5.02	...	...	512×512	524.6
ADM	3.60?	-	-	-	-	256×256	1120
LDM-4	3.95	-	-	-	-	256×256	103.6
LDM-8	7.76	-	-	-	-	256×256	-

Diffusion Transformers (DiTs)는 Gflops가 증가하면 샘플 품질이 향상되며 더 깊고 넓거나 더 많은 토큰을 가질수록 FID가 개선된다.
DiT-XL/2는 256×256 ImageNet에서 분류자-프리 가이던스로 최첨단 FID 2.27를 달성하여 기존의 확산 모델보다 우수하다.
512×512 ImageNet에서 DiT-XL/2는 FID 3.04를 달성하여 많은 기존 확산 방법을 능가하고 픽셀 공간 확산 모델에 비해 훨씬 적은 Gflops를 사용한다.
conditioning 전략 중 adaLN-Zero가 가장 낮은 추가 Gflops로 최상위 FID를 제공하며, in-context 및 cross-attention 설계보다 우수하다.
LDM 기반 DiTs는 계산 효율이 높다: DiT-XL/2는 LDM-4/8에 비해 계산 효율이 좋고 픽셀 공간 ADM 변형에 비해 훨씬 뛰어나다.
DiT 모델의 스케일링은 학습 단계 및 패치 크기에 따라 일관된 FID 향상을 보이며, 파라미터 수가 품질의 유일한 예측 변수는 아님을 보여준다.

Figure 3 : The Diffusion Transformer (DiT) architecture. Left: We train conditional latent DiT models. The input latent is decomposed into patches and processed by several DiT blocks. Right: Details of our DiT blocks. We experiment with variants of standard transformer blocks that incorporate condit

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.