QUICK REVIEW

[论文解读] Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

Eric Luhman, Troy Luhman|arXiv (Cornell University)|Jan 7, 2021

Generative Adversarial Networks and Image Synthesis参考文献 40被引用 62

一句话总结

本文提出一种知识蒸馏方法，将多步、确定性的 DDIM 采样过程压缩为单步去噪学生模型，在 CIFAR-10、CelebA、和 LSUN 上实现了类似 GAN 的采样速度且高质量样本。

ABSTRACT

Iterative generative models, such as noise conditional score networks and denoising diffusion probabilistic models, produce high quality samples by gradually denoising an initial noise vector. However, their denoising process has many steps, making them 2-3 orders of magnitude slower than other generative models such as GANs and VAEs. In this paper, we establish a novel connection between knowledge distillation and image generation with a technique that distills a multi-step denoising process into a single step, resulting in a sampling speed similar to other single-step generative models. Our Denoising Student generates high quality samples comparable to GANs on the CIFAR-10 and CelebA datasets, without adversarial training. We demonstrate that our method scales to higher resolutions through experiments on 256 x 256 LSUN. Code and checkpoints are available at https://github.com/tcl9876/Denoising_Student

研究动机与目标

通过减少去噪步骤的数量，推动迭代生成模型（如 DDPMs、NCSNs）更快的采样。
提出一个知识蒸馏框架，让一个快速的学生模型学习匹配教师的 DDIM 输出。
提供一个简单的非对抗性目标，使蒸馏在不改变架构或训练动态的情况下实现。
展示对更高分辨率（如 256x256 LSUN）的可扩展性，同时保留有意义的潜在表示。

提出的方法

将教师建模为具有确定性、多步生成过程的 DDIM。
定义一个学生，其输出 p(x0|xT) 的高斯近似，具有可训练的均值 F_student(xT) 和单位方差。
通过最小化 KL(p_teacher(x0|xT) || p_student(x0|xT)) 来训练学生，这简化为 F_student(xT) 与教师输出 F_teacher(xT) 之间的回归损失。
用与教师的噪声预测器相同的架构/权重初始化学生，以迁移知识。
在教师输出中添加高斯噪声，以确保训练时输出分布非零。
将教师和学生都以时间步 T（最高噪声水平）为条件。
利用 DDIM 的确定性特性，使从 xT 到 x0 的采样可以被学生在一次评估中完成。

实验结果

研究问题

RQ1知识蒸馏能否在不使用对抗性训练的情况下，将多步骤的 DDIM 去噪过程转移到单步模型？
RQ2单步去噪学生在标准基准（CIFAR-10、CelebA）及更高分辨率数据集（LSUN 256x256）上的样本质量（FID/IS）和速度表现如何？
RQ3蒸馏模型是否能够保持潜在空间结构并实现有意义的插值？

主要发现

模型	FID ↓	IS ↑	Steps ↓
Denoising Student (Ours)	9.36	8.36	1
NVAE [38]	51.67	5.51	1
MoLM [25]	18.9	7.90	1
SNGAN [23]	21.7	8.22	1
BigGAN (cond.) [1]	14.73	9.22	1
PPOGAN [41]	10.87	8.69	1
StyleGAN2+ADA [16]	2.92	9.83	1
StyleGAN2+ADA (cond.) [16]	2.42	10.14	1
DDIM (100 step, Teacher)	4.16	8.96*	100
EBM [5]	38.2	6.78	60
VAEBM [42]	12.19	8.43	16
EBM+recovery likelihood [8]	9.60	8.58	180
NCSNv2 [32]	10.87	8.40	1160
DDPM [13]	3.17	9.46	1000
NCSN++ (8 blocks/res) [33]	2.20	9.89	2000

Denoising Student 在 CIFAR-10 上仅用 1 步就取得有竞争力的 FID 和 IS（FID 9.36，IS 8.36）。
在 CelebA 上，模型实现了有竞争力的 FID 10.68（IS 未给出）。
对于更高分辨率的 LSUN (256x256)，模型展示出连贯的结构和颜色，但由于逐像素复制目标，仍存在一些纹理模糊。
采样速度大幅提升：相比教师约快 100 倍，相较 DDPM 快 1000 倍（CIFAR-10），生成 50k 图像耗时 51.5 秒。
该方法可扩展到 256x256 的 LSUN 图像，并学习潜在表示，能够实现有意义的插值，如球面插值结果所示。
该方法不依赖对抗训练，且保留类似隐式模型的潜在空间操控能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。