QUICK REVIEW

[논문 리뷰] Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

Zhisheng Xiao, Karsten Kreis|arXiv (Cornell University)|2021. 12. 15.

Generative Adversarial Networks and Image Synthesis인용 수 143

한 줄 요약

요약: 본 논문은 멀티모달 디노이징 단계를 조건부 GAN으로 모델링하여 확산 샘플링의 속도를 크게 높이고, 샘플 품질과 다양성을 유지하며 생성 학습의 트릴레마를 해결하는 denoising diffusion GAN을 제안한다.

ABSTRACT

A wide variety of deep generative models has been developed in the past decade. Yet, these models often struggle with simultaneously addressing three key requirements including: high sample quality, mode coverage, and fast sampling. We call the challenge imposed by these requirements the generative learning trilemma, as the existing models often trade some of them for others. Particularly, denoising diffusion models have shown impressive sample quality and diversity, but their expensive sampling does not yet allow them to be applied in many real-world applications. In this paper, we argue that slow sampling in these models is fundamentally attributed to the Gaussian assumption in the denoising step which is justified only for small step sizes. To enable denoising with large steps, and hence, to reduce the total number of denoising steps, we propose to model the denoising distribution using a complex multimodal distribution. We introduce denoising diffusion generative adversarial networks (denoising diffusion GANs) that model each denoising step using a multimodal conditional GAN. Through extensive evaluations, we show that denoising diffusion GANs obtain sample quality and diversity competitive with original diffusion models while being 2000$ imes$ faster on the CIFAR-10 dataset. Compared to traditional GANs, our model exhibits better mode coverage and sample diversity. To the best of our knowledge, denoising diffusion GAN is the first model that reduces sampling cost in diffusion models to an extent that allows them to be applied to real-world applications inexpensively. Project page and code can be found at https://nvlabs.github.io/denoising-diffusion-gan

연구 동기 및 목표

생성 학습의 트릴레마를 동기부여: 고품질 샘플링, 모드 커버리지, 빠른 샘플링.
품질과 다양성을 크게 유지하면서 비교적 적은 denoising 단계로 확산 기반 모델을 개발.
멀티모달 덴oise 배포가 빠르고 현실적인 이미지 생성 및 편집을 가능하게 한다는 것을 입증한다.

제안 방법

데이터에 잡음을 더하는 포워드 디노이징 단계의 수를 작게 설정하여(대략 T ≤ 8) forward diffusion을 형식화한다.
denoising 분포 q(x_{t−1}|x_t)를 다중모드 조건부 GAN p_θ(x_{t−1}|x_t)로 모델링한다.
시간 조건부의 GAN 기반 denoiser를 사용하고 판별기 D_φ를 통해 각 단계에서 q와 p_θ를 대립적 손실로 정렬한다(min_θ sum_t E_{q(x_t)}[D_adv(q(x_{t−1}|x_t) || p_θ(x_{t−1}|x_t))]).
암묵적 x0 예측기 G_θ(x_t, z, t)와 가우시안 후방 q(x_{t−1}|x_t, x0)를 통해 p_θ(x_{t−1}|x_t)를 매개화하여 다모드성을 가능하게 한다.
잠재 변수 z를 도입하여 다모달리티를 유도하고 모드 커버리지와 다양성을 개선한다.
DDPM 유사 백본을 갖되 확률적 다모달 x0 예측기를 갖춘 MM-호환 학습 파이프라인을 활용한다.

실험 결과

연구 질문

RQ1각 확산 단계에서 다모달 덴oise 분포가 샘플 품질의 손실 없이 denoising 단계를 줄일 수 있는가?
RQ2조건부 GAN으로 denoising 단계를 모델링하는 것이 가우시안 디노이저에 비해 모드 커버리지와 다양성을 개선하는가?
RQ3제안된 접근법은 품질, 다양성, 샘플링 속도 측면에서 표준 확산 모델과 GAN과 어떻게 비교되는가?
RQ4높은 해상도 데이터와 편집 작업(예: 스트로크 기반 합성)에서도 속도 이득을 유지하며 스케일링이 가능한가?

주요 결과

모델	IS ↑	FID ↓	Recall ↑	NFE ↓	Time (s) ↓
Denoising Diffusion GAN (ours), T=4	9.63	3.75	0.57	4	0.21
DDPM (Ho et al., 2020)	9.46	3.21	0.57	1000	80.5
NCSN (Song & Ermon, 2019)	8.87	25.3	-	1000	107.9
Adversarial DSM (Jolicoeur-Martineau et al., 2021b)	-	6.10	-	1000	-
Likelihood SDE (Song et al., 2021b)	-	2.87	-	-	-
Score SDE (VE) (Song et al., 2021c)	9.89	2.20	0.59	2000	423.2
Score SDE (VP) (Song et al., 2021c)	9.68	2.41	0.59	2000	421.5
Probability Flow (VP) (Song et al., 2021c)	9.83	3.08	0.57	140	50.9
LSGM (Vahdat et al., 2021)	9.87	2.10	0.61	147	44.5
DDIM, T=50 (Song et al., 2021a)	8.78	4.67	0.53	50	4.01
FastDDPM, T=50 (Kong & Ping, 2021)	8.98	3.41	0.56	50	4.01
Recovery EBM (Gao et al., 2021)	8.30	9.58	-	180	-
Improved DDPM (Nichol & Dhariwal, 2021)	-	2.90	-	4000	-
VDM (Kingma et al., 2021)	-	4.00	-	1000	-
UDM (Kim et al., 2021)	10.1	2.33	-	2000	-
D3PMs (Austin et al., 2021)	8.56	7.34	-	1000	-
Gotta Go Fast (Jolicoeur-Martineau et al., 2021a)	-	2.44	-	180	-
DDPM Distillation (Luhman & Luhman, 2021)	8.36	9.36	0.51	1	-
SNGAN (Miyato et al., 2018)	8.22	21.7	0.44	1	-
SNGAN+DGflow (Ansari et al., 2021)	9.35	9.62	0.48	25	1.98
AutoGAN (Gong et al., 2019)	8.60	12.4	0.46	1	-
TransGAN (Jiang et al., 2021)	9.02	9.26	-	1	-
StyleGAN2 w/o ADA (Karras et al., 2020a)	9.18	8.32	0.41	1	0.04
StyleGAN2 w/ ADA (Karras et al., 2020a)	9.83	2.92	0.49	1	0.04
StyleGAN2 w/ Diffaug (Zhao et al., 2020)	9.40	5.79	0.42	1	0.04
Glow (Kingma & Dhariwal, 2018)	3.92	48.9	-	1	-
PixelCNN (Oord et al., 2016b)	4.60	65.9	-	1024	-
NVAE (Vahdat & Kautz, 2020)	7.18	23.5	0.51	1	0.36

Denoising diffusion GANs는 확산 모델과 경쟁력 있는 샘플 품질과 다양성을 보여주며 최소 2–4개의 denoising 단계로도 가능하다.
CIFAR-10에서 이 방법은 NFE 4, 시간 0.21초로 IS 9.63, FID 3.75를 달성하여 기존의 확산 방법에 비해 상당히 빠르다.
모델은 많은 GAN 변형들보다 더 높은 기억도(0.57)를 보이며 모드 커버리지가 향상되었음을 나타낸다.
predictor-corrector diffusion(Song et al., 2021c)과 비교하면 CIFAR-10에서 샘플링이 약 2000배 faster; FastDDPM보다 약 20배 더 빠름.
잠재 변수는 다모달리티에 중요하다; z를 제거하면 샘플 품질과 recall이 저하된다.
모드 커버리지 실험(25-Gaussians, StackedMNIST)은 완전한 모드 커버리지를 보이고 KL 발산이 낮아 여러 GAN 및 확산 베이스라인보다 우수하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.