QUICK REVIEW

[論文レビュー] Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

Zhisheng Xiao, Karsten Kreis|arXiv (Cornell University)|Dec 15, 2021

Generative Adversarial Networks and Image Synthesis被引用数 143

ひとこと要約

本論文は、条件付き GAN を用いて多 modalities の denoising step をモデル化する denoising diffusion GANs を提案し、拡散サンプリングを飛躍的に高速化しつつサンプル品質と多様性を維持し、生成学習のトリレンマに対処する。

ABSTRACT

A wide variety of deep generative models has been developed in the past decade. Yet, these models often struggle with simultaneously addressing three key requirements including: high sample quality, mode coverage, and fast sampling. We call the challenge imposed by these requirements the generative learning trilemma, as the existing models often trade some of them for others. Particularly, denoising diffusion models have shown impressive sample quality and diversity, but their expensive sampling does not yet allow them to be applied in many real-world applications. In this paper, we argue that slow sampling in these models is fundamentally attributed to the Gaussian assumption in the denoising step which is justified only for small step sizes. To enable denoising with large steps, and hence, to reduce the total number of denoising steps, we propose to model the denoising distribution using a complex multimodal distribution. We introduce denoising diffusion generative adversarial networks (denoising diffusion GANs) that model each denoising step using a multimodal conditional GAN. Through extensive evaluations, we show that denoising diffusion GANs obtain sample quality and diversity competitive with original diffusion models while being 2000$\times$ faster on the CIFAR-10 dataset. Compared to traditional GANs, our model exhibits better mode coverage and sample diversity. To the best of our knowledge, denoising diffusion GAN is the first model that reduces sampling cost in diffusion models to an extent that allows them to be applied to real-world applications inexpensively. Project page and code can be found at https://nvlabs.github.io/denoising-diffusion-gan

研究の動機と目的

高品質サンプリング、モードカバー、そして高速サンプリングという生成学習のトリレンマを動機付ける。
品質と多様性を大幅に減少した denoising steps で維持する拡散ベースのモデルを開発する。
multimo dal denoising distributions が高速で現実的な画像生成と編集を可能にすることを実証する。

提案手法

データにノイズを加える forward diffusion を、少数の大きな denoising steps (T ≤ 8) で設計する。
denoising distribution q(x_{t−1}|x_t) を multimodal conditional GAN p_θ(x_{t−1}|x_t) としてモデル化する。
time-conditioned の GAN ベースの denoiser を用い、各ステップで adversarial loss を介して q と p_θ を D_φ によって align する（min_θ sum_t E_{q(x_t)}[D_adv(q(x_{t−1}|x_t) || p_θ(x_{t−1}|x_t))]）。
p_θ(x_{t−1}|x_t) を implicit x0 predictor G_θ(x_t, z, t) と Gaussian posterior q(x_{t−1}|x_t, x0) を通じてパラメータ化し、多モダリティを実現する。
潜在変数 z を取り入れて多モダリティを誘導し、モードカバーと多様性を向上させる。
DDPM のようなバックボーンを持ちつつ、確率的で多モーダルな x0 predictor を備えた MM-compatible 学習パイプラインを活用する。

実験結果

リサーチクエスチョン

RQ1各拡散ステップでの多モダリ denoising distribution が denoising steps の数を減らしてもサンプル品質を犠牲にしないか。
RQ2denoising step を conditional GAN でモデル化することは Gaussian denoiser と比べてモードカバーと多様性を改善するか。
RQ3提案手法は忠実度、多様性、サンプリング速度の点で標準的な拡散モデルと GAN の比較でどうか。
RQ4高解像度データや編集タスク（例： stroke-based synthesis）へ拡張しても速度向上を維持できるか。

主な発見

モデル	IS ↑	FID ↓	Recall ↑	NFE ↓	時間 (秒) ↓
Denoising Diffusion GAN (ours), T=4	9.63	3.75	0.57	4	0.21
DDPM (Ho et al., 2020)	9.46	3.21	0.57	1000	80.5
NCSN (Song & Ermon, 2019)	8.87	25.3	-	1000	107.9
Adversarial DSM (Jolicoeur-Martineau et al., 2021b)	-	6.10	-	1000	-
Likelihood SDE (Song et al., 2021b)	-	2.87	-	-	-
Score SDE (VE) (Song et al., 2021c)	9.89	2.20	0.59	2000	423.2
Score SDE (VP) (Song et al., 2021c)	9.68	2.41	0.59	2000	421.5
Probability Flow (VP) (Song et al., 2021c)	9.83	3.08	0.57	140	50.9
LSGM (Vahdat et al., 2021)	9.87	2.10	0.61	147	44.5
DDIM, T=50 (Song et al., 2021a)	8.78	4.67	0.53	50	4.01
FastDDPM, T=50 (Kong & Ping, 2021)	8.98	3.41	0.56	50	4.01
Recovery EBM (Gao et al., 2021)	8.30	9.58	-	180	-
Improved DDPM (Nichol & Dhariwal, 2021)	-	2.90	-	4000	-
VDM (Kingma et al., 2021)	-	4.00	-	1000	-
UDM (Kim et al., 2021)	10.1	2.33	-	2000	-
D3PMs (Austin et al., 2021)	8.56	7.34	-	1000	-
Gotta Go Fast (Jolicoeur-Martineau et al., 2021a)	-	2.44	-	180	-
DDPM Distillation (Luhman & Luhman, 2021)	8.36	9.36	0.51	1	-
SNGAN (Miyato et al., 2018)	8.22	21.7	0.44	1	-
SNGAN+DGflow (Ansari et al., 2021)	9.35	9.62	0.48	25	1.98
AutoGAN (Gong et al., 2019)	8.60	12.4	0.46	1	-
TransGAN (Jiang et al., 2021)	9.02	9.26	-	1	-
StyleGAN2 w/o ADA (Karras et al., 2020a)	9.18	8.32	0.41	1	0.04
StyleGAN2 w/ ADA (Karras et al., 2020a)	9.83	2.92	0.49	1	0.04
StyleGAN2 w/ Diffaug (Zhao et al., 2020)	9.40	5.79	0.42	1	0.04
Glow (Kingma & Dhariwal, 2018)	3.92	48.9	-	1	-
PixelCNN (Oord et al., 2016b)	4.60	65.9	-	1024	-
NVAE (Vahdat & Kautz, 2020)	7.18	23.5	0.51	1	0.36

Denoising Diffusion GANs は拡散モデルと競合するサンプル品質と多様性を、わずか 2–4 denoising steps で実現する。
CIFAR-10 において、本手法は NFE4 かつ time 0.21s で IS 9.63, FID 3.75 を達成し、従来の拡散法より著しく高速。
モデルは多くの GAN 変種より recall が高く（0.57）、モードカバーの改善を示す。
predictor-corrector diffusion (Song et al., 2021c) と比べ CIFAR-10 で約 2000×、FastDDPM より約 20×速い。
潜在変数は多モダリティにとって重要であり、z を除くとサンプル品質と recall が低下する。
モードカバー実験（25-Gaussians、StackedMNIST）は全モードカバーと低 KL 発散を示し、いくつかの GAN および diffusion ベースのベースラインを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。