QUICK REVIEW

[论文解读] Improved Techniques for Training Score-Based Generative Models

Yang Song, Stefano Ermon|arXiv (Cornell University)|Jun 16, 2020

Generative Adversarial Networks and Image Synthesis参考文献 32被引用 142

一句话总结

这篇论文分析将基于分数的生成模型扩展到高分辨率图像，并引入技术（噪声尺度、条件化、EMA）使高保真样本达到与GAN在64×64到256×256图像上的可比性。

ABSTRACT

Score-based generative models can produce high quality image samples comparable to GANs, without requiring adversarial optimization. However, existing training procedures are limited to images of low resolution (typically below 32x32), and can be unstable under some settings. We provide a new theoretical analysis of learning and sampling from score models in high dimensional spaces, explaining existing failure modes and motivating new solutions that generalize across datasets. To enhance stability, we also propose to maintain an exponential moving average of model weights. With these improvements, we can effortlessly scale score-based generative models to images with unprecedented resolutions ranging from 64x64 to 256x256. Our score-based models can generate high-fidelity samples that rival best-in-class GANs on various image datasets, including CelebA, FFHQ, and multiple LSUN categories.

研究动机与目标

解释先前基于分数的模型在高分辨率图像上的局限性。
提出在理论上有据可循的方法，用于选择噪声尺度与采样参数。
提出可提升稳定性和样本质量的架构与训练技巧。
展示在跨越多样数据集的64×64–256×256图像上的可扩展性。

提出的方法

基于数据分布的高斯噪声尺度选择的分析性指导。
通过单一网络在多种噪声尺度上进行摊销的分数估计（噪声条件化）。
对 Langevin 动力学的理论分析，以在多种噪声尺度上优化采样性能。
在采样过程中对模型参数进行指数移动平均（EMA）以提高稳定性。
去噪步骤（受 Tweedie 公式启发）以提升最终样本质量。
将上述方法整合为端到端的训练与采样方案（NCSNv2）。

实验结果

研究问题

RQ1如何将基于分数的模型从 32×32 扩展到高分辨率图像（64×64–256×256）？
RQ2哪种噪声尺度配置和条件化方法能够实现可靠学习以及快速且高质量的采样？
RQ3对参数进行指数移动平均能否稳定训练并提高样本保真度？
RQ4单一的摊销网络是否能够有效处理多种噪声尺度？
RQ5在跨数据集应用这些技术时，标准指标（FID/Inception）的定量提升是多少？

主要发现

模型	Inception ↑	FID ↓
Unconditional PixelCNN [17]	4.60	65.93
IGEBM [18]	6.02	40.58
WGAN-GP [19]	7.86±.07	36.4
SNGAN [20]	8.22±.05	21.7
NCSN [1]	8.87±.12	25.32
NCSN (w/ denoising)	N/A	29.8
NCSNv2 (w/o denoising)	8.73±.13	31.75
NCSNv2 (w/ denoising)	8.40±.07	10.87
CelebA 64×64: NCSN (w/o denoising)	−	26.89
CelebA 64×64: NCSN (w/ denoising)	−	25.30
CelebA 64×64: NCSNv2 (w/o denoising)	−	28.86
CelebA 64×64: NCSNv2 (w/ denoising)	−	10.23

NCSNv2 在64×64的 CelebA 与128–256×256 的 LSUN/FFHQ 数据集上实现高保真样本，超越了先前的基于分数的模型。
最优的初始噪声尺度应当与训练数据中最大成对距离一样大，以促进多样性。
以特定比率的噪声尺度几何级数进行可覆盖高密度区域的稳定训练。
将无条件分数网络按1/σ重新缩放以融入噪声信息，从而在多尺度下改善训练。
通过数据驱动分析选择采样步数和步长可减少调参并改善混合性。
在采样过程中对模型参数的指数移动平均显著稳定 FID 并减少伪影。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。