QUICK REVIEW

[论文解读] TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

David Berthelot, Arnaud Autef|arXiv (Cornell University)|Mar 7, 2023

Advanced Neuroimaging Techniques and Applications被引用 12

一句话总结

TRACT 通过跨时闭包蒸馏显著提升单步和少步扩散采样，在不改变架构的情况下实现 CIFAR-10 的 1 步 DDIM 与 64×64 ImageNet 的最优 FID 分数。

ABSTRACT

Denoising Diffusion models have demonstrated their proficiency for generative sampling. However, generating good samples often requires many iterations. Consequently, techniques such as binary time-distillation (BTD) have been proposed to reduce the number of network calls for a fixed architecture. In this paper, we introduce TRAnsitive Closure Time-distillation (TRACT), a new method that extends BTD. For single step diffusion,TRACT improves FID by up to 2.4x on the same architecture, and achieves new single-step Denoising Diffusion Implicit Models (DDIM) state-of-the-art FID (7.4 for ImageNet64, 3.8 for CIFAR10). Finally we tease apart the method through extended ablations. The PyTorch implementation will be released soon.

研究动机与目标

Motivate and reduce inference cost of diffusion models by enabling single- or few-step sampling without architecture changes.
Identify limitations of binary time-distillation (BTD) such as objective degeneracy and SWA incompatibility.
Propose TRACT to distill outputs across time steps via transitive closure with self-teaching to maintain quality with few phases.
Show that TRACT achieves state-of-the-art or competitive FID with 1-2 steps on CIFAR-10 and 64×64 ImageNet and analyze ablations.

提出的方法

Extend binary time-distillation (BTD) to Transitive Closure Time-Distillation (TRACT) reducing distillation phases from log2(T) to a small constant (1–2).
Train a student to distill the teacher’s inference from t to t' where t' < t using a self-teacher EMA to perform transitive closure (equations 6–9).
Use a self-teaching EMA of the student weights to generate targets for multi-step jumps (Algorithm 1).
Adapt TRACT to VE/EDM settings with RK and DDIM-VE teachers and derive corresponding targets and losses (equations 11–15).
Mitigate objective degeneracy by limiting distillation phases and leveraging self-teaching with EMA and inference-time EMA.
Provide training details including group-based distillation, loss weighting, and EMA updates (Appendix references).

Figure 1 : Transitive Closure Distillation of a group $\{t_{i},\ldots,t_{j}\}$ .

实验结果

研究问题

RQ1Can TRACT achieve high-quality samples with 1–2 inference steps without architectural changes?
RQ2Does reducing the number of distillation phases mitigate objective degeneracy and enable effective SWA?
RQ3How does TRACT perform with VE/EDM teachers and alternative samplers (RK, DDIM-VE) across CIFAR-10 and 64×64 ImageNet?

主要发现

方法	NFE 次数	FID	参数量
TRACT-EDM-256M ∗	1	3.78 ± 0.01	56M
TRACT-96M ∗	1	4.17 ± 0.03	56M
TRACT-256M	1	4.45 ± 0.05	60M
BTD-96M [44]	-	9.12	60M
TRACT-96M	2	3.32 ± 0.02	60M
TRACT-EDM-256M ∗	2	3.55 ± 0.01	56M
TRACT-EDM-96M ∗	2	3.75 ± 0.02	56M
BTD-96M [44]	-	4.51	60M
TRACT-96M	1	7.43 ± 0.07	296M
TRACT-EDM-96M ∗	1	7.52 ± 0.05	296M

1-step TRACT improves CIFAR-10 FID from 9.1 (BTD) to 4.5 on CIFAR-10 with the same architecture and budget for 1-step setups.
1-step TRACT achieves 7.4 FID on 64×64 ImageNet with 1-step sampling using EDM teachers, improving over BTD baselines.
2-step TRACT reaches 3.32 FID on CIFAR-10 with 32-step teacher distilled to 1-step, and 7.43 FID on 64×64 ImageNet with single-step distillation.
.TRACT-EDM-256M achieves 3.78±0.01 FID with 1 NFE on CIFAR-10; TRACT-EDM-96M achieves 3.75±0.02 FID with 1 NFE on CIFAR-10 (Tables 1 and related text).
On 64×64 ImageNet, TRACT-96M achieves 7.43±0.07 FID with 1 NFE and TRACT-EDM-96M achieves 7.52±0.05 with the same setup (Tables 2 and related text).
Ablations show best performance with a 2-phase schedule (1024→32→1) and EMA-based self-teaching; more phases degrade performance due to objective degeneracy.

Figure 5 : 1-step FID for 2-phases $T:1024\to 32\to 1$ TRACT distilled models. Each curve maps to a different way to set the inference time EMA momentum $\mu$ across training lengths. Dashed lines correspond to fixing a $\mu$ value, solid lines correspond to fixing $\epsilon=\mu^{N}$ .

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。