QUICK REVIEW

[论文解读] Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

Zhisheng Xiao, Karsten Kreis|arXiv (Cornell University)|Dec 15, 2021

Generative Adversarial Networks and Image Synthesis被引用 143

一句话总结

本论文提出去噪扩散GAN，利用条件GAN对多模态去噪步骤进行建模，在显著加速扩散采样的同时保持样本质量与多样性，解决生成学习三难困境。

ABSTRACT

A wide variety of deep generative models has been developed in the past decade. Yet, these models often struggle with simultaneously addressing three key requirements including: high sample quality, mode coverage, and fast sampling. We call the challenge imposed by these requirements the generative learning trilemma, as the existing models often trade some of them for others. Particularly, denoising diffusion models have shown impressive sample quality and diversity, but their expensive sampling does not yet allow them to be applied in many real-world applications. In this paper, we argue that slow sampling in these models is fundamentally attributed to the Gaussian assumption in the denoising step which is justified only for small step sizes. To enable denoising with large steps, and hence, to reduce the total number of denoising steps, we propose to model the denoising distribution using a complex multimodal distribution. We introduce denoising diffusion generative adversarial networks (denoising diffusion GANs) that model each denoising step using a multimodal conditional GAN. Through extensive evaluations, we show that denoising diffusion GANs obtain sample quality and diversity competitive with original diffusion models while being 2000$ imes$ faster on the CIFAR-10 dataset. Compared to traditional GANs, our model exhibits better mode coverage and sample diversity. To the best of our knowledge, denoising diffusion GAN is the first model that reduces sampling cost in diffusion models to an extent that allows them to be applied to real-world applications inexpensively. Project page and code can be found at https://nvlabs.github.io/denoising-diffusion-gan

研究动机与目标

激发生成学习三难困境：高质量采样、模式覆盖与快速采样。
开发一个基于扩散的模型，在显著更少的去噪步骤下维持质量和多样性。
证明多模态去噪分布能够实现快速、逼真的图像生成与编辑。

提出的方法

将前向扩散公式化，在数据上加入噪声但仅用少量的大幅去噪步骤（T ≤ 8）。
将去噪分布 q(x_{t−1}|x_t) 模型化为多模态条件GAN p_θ(x_{t−1}|x_t)。
使用一个带时间条件的、基于GAN 的去噪器及判别器 D_φ，通过对抗损失在每一步对齐 q 和 p_θ（min_θ sum_t E_{q(x_t)}[D_adv(q(x_{t−1}|x_t) || p_θ(x_{t−1}|x_t))]）。
通过隐式 x0 预测器 G_θ(x_t, z, t) 和高斯后验 q(x_{t−1}|x_t, x0) 对 p_θ(x_{t−1}|x_t) 进行参数化，以实现多模态性。
引入潜在变量 z 以诱导多模态性并提升模式覆盖与多样性。
利用一个与 DDPM 类似骨干的、MM 兼容的训练流程，但具有随机的多模态 x0 预测器。

实验结果

研究问题

RQ1在扩散每一步中，是否通过多模态去噪分布可以在不牺牲样本质量的前提下减少去噪步骤数量？
RQ2使用条件GAN建模去噪步骤，是否比高斯去噪器在模式覆盖与多样性方面有所提升？
RQ3与标准扩散模型和GANs相比，该方法在保真度、多样性和采样速度方面的表现如何？
RQ4该方法在更高分辨率数据和编辑任务（如基于笔画的合成）上是否具备扩展性且仍保持速度提升？

主要发现

模型	IS ↑	FID ↓	Recall ↑	NFE ↓	时间(s) ↓
Denoising Diffusion GAN (ours), T=4	9.63	3.75	0.57	4	0.21
DDPM (Ho et al., 2020)	9.46	3.21	0.57	1000	80.5
NCSN (Song & Ermon, 2019)	8.87	25.3	-	1000	107.9
Adversarial DSM (Jolicoeur-Martineau et al., 2021b)	-	6.10	-	1000	-
Likelihood SDE (Song et al., 2021b)	-	2.87	-	-	-
Score SDE (VE) (Song et al., 2021c)	9.89	2.20	0.59	2000	423.2
Score SDE (VP) (Song et al., 2021c)	9.68	2.41	0.59	2000	421.5
Probability Flow (VP) (Song et al., 2021c)	9.83	3.08	0.57	140	50.9
LSGM (Vahdat et al., 2021)	9.87	2.10	0.61	147	44.5
DDIM, T=50 (Song et al., 2021a)	8.78	4.67	0.53	50	4.01
FastDDPM, T=50 (Kong & Ping, 2021)	8.98	3.41	0.56	50	4.01
Recovery EBM (Gao et al., 2021)	8.30	9.58	-	180	-
Improved DDPM (Nichol & Dhariwal, 2021)	-	2.90	-	4000	-
VDM (Kingma et al., 2021)	-	4.00	-	1000	-
UDM (Kim et al., 2021)	10.1	2.33	-	2000	-
D3PMs (Austin et al., 2021)	8.56	7.34	-	1000	-
Gotta Go Fast (Jolicoeur-Martineau et al., 2021a)	-	2.44	-	180	-
DDPM Distillation (Luhman & Luhman, 2021)	8.36	9.36	0.51	1	-
SNGAN (Miyato et al., 2018)	8.22	21.7	0.44	1	-
SNGAN+DGflow (Ansari et al., 2021)	9.35	9.62	0.48	25	1.98
AutoGAN (Gong et al., 2019)	8.60	12.4	0.46	1	-
TransGAN (Jiang et al., 2021)	9.02	9.26	-	1	-
StyleGAN2 w/o ADA (Karras et al., 2020a)	9.18	8.32	0.41	1	0.04
StyleGAN2 w/ ADA (Karras et al., 2020a)	9.83	2.92	0.49	1	0.04
StyleGAN2 w/ Diffaug (Zhao et al., 2020)	9.40	5.79	0.42	1	0.04
Glow (Kingma & Dhariwal, 2018)	3.92	48.9	-	1	-
PixelCNN (Oord et al., 2016b)	4.60	65.9	-	1024	-
NVAE (Vahdat & Kautz, 2020)	7.18	23.5	0.51	1	0.36

去噪扩散GAN 在使用仅为 2–4 步去噪的情况下，其样本质量与多样性可与扩散模型相媲美。
在 CIFAR-10 上，该方法在 NFE 为 4、耗时 0.21s 的情况下达到 IS = 9.63、FID = 3.75，显著快于此前的扩散方法。
模型显示出比许多 GAN 变体更强的 recall（0.57），表明模式覆盖有所提升。
与 predictor-corrector 扩散（Song et al., 2021c）相比，在 CIFAR-10 的采样速度约快 2000×；比 FastDDPM 快约 20×。
潜在变量对多模态性非常重要；去除 z 将降低样本质量和 recall。
模式覆盖实验（25-Gaussians，StackedMNIST）显示完全的模式覆盖且 KL 发散较低，优于若干 GAN 与扩散基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。