[论文解读] TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation
TRACT 通过跨时闭包蒸馏显著提升单步和少步扩散采样,在不改变架构的情况下实现 CIFAR-10 的 1 步 DDIM 与 64×64 ImageNet 的最优 FID 分数。
Denoising Diffusion models have demonstrated their proficiency for generative sampling. However, generating good samples often requires many iterations. Consequently, techniques such as binary time-distillation (BTD) have been proposed to reduce the number of network calls for a fixed architecture. In this paper, we introduce TRAnsitive Closure Time-distillation (TRACT), a new method that extends BTD. For single step diffusion,TRACT improves FID by up to 2.4x on the same architecture, and achieves new single-step Denoising Diffusion Implicit Models (DDIM) state-of-the-art FID (7.4 for ImageNet64, 3.8 for CIFAR10). Finally we tease apart the method through extended ablations. The PyTorch implementation will be released soon.
研究动机与目标
- Motivate and reduce inference cost of diffusion models by enabling single- or few-step sampling without architecture changes.
- Identify limitations of binary time-distillation (BTD) such as objective degeneracy and SWA incompatibility.
- Propose TRACT to distill outputs across time steps via transitive closure with self-teaching to maintain quality with few phases.
- Show that TRACT achieves state-of-the-art or competitive FID with 1-2 steps on CIFAR-10 and 64×64 ImageNet and analyze ablations.
提出的方法
- Extend binary time-distillation (BTD) to Transitive Closure Time-Distillation (TRACT) reducing distillation phases from log2(T) to a small constant (1–2).
- Train a student to distill the teacher’s inference from t to t' where t' < t using a self-teacher EMA to perform transitive closure (equations 6–9).
- Use a self-teaching EMA of the student weights to generate targets for multi-step jumps (Algorithm 1).
- Adapt TRACT to VE/EDM settings with RK and DDIM-VE teachers and derive corresponding targets and losses (equations 11–15).
- Mitigate objective degeneracy by limiting distillation phases and leveraging self-teaching with EMA and inference-time EMA.
- Provide training details including group-based distillation, loss weighting, and EMA updates (Appendix references).

实验结果
研究问题
- RQ1Can TRACT achieve high-quality samples with 1–2 inference steps without architectural changes?
- RQ2Does reducing the number of distillation phases mitigate objective degeneracy and enable effective SWA?
- RQ3How does TRACT perform with VE/EDM teachers and alternative samplers (RK, DDIM-VE) across CIFAR-10 and 64×64 ImageNet?
主要发现
| 方法 | NFE 次数 | FID | 参数量 |
|---|---|---|---|
| TRACT-EDM-256M ∗ | 1 | 3.78 ± 0.01 | 56M |
| TRACT-96M ∗ | 1 | 4.17 ± 0.03 | 56M |
| TRACT-256M | 1 | 4.45 ± 0.05 | 60M |
| BTD-96M [44] | - | 9.12 | 60M |
| TRACT-96M | 2 | 3.32 ± 0.02 | 60M |
| TRACT-EDM-256M ∗ | 2 | 3.55 ± 0.01 | 56M |
| TRACT-EDM-96M ∗ | 2 | 3.75 ± 0.02 | 56M |
| BTD-96M [44] | - | 4.51 | 60M |
| TRACT-96M | 1 | 7.43 ± 0.07 | 296M |
| TRACT-EDM-96M ∗ | 1 | 7.52 ± 0.05 | 296M |
- 1-step TRACT improves CIFAR-10 FID from 9.1 (BTD) to 4.5 on CIFAR-10 with the same architecture and budget for 1-step setups.
- 1-step TRACT achieves 7.4 FID on 64×64 ImageNet with 1-step sampling using EDM teachers, improving over BTD baselines.
- 2-step TRACT reaches 3.32 FID on CIFAR-10 with 32-step teacher distilled to 1-step, and 7.43 FID on 64×64 ImageNet with single-step distillation.
- .TRACT-EDM-256M achieves 3.78±0.01 FID with 1 NFE on CIFAR-10; TRACT-EDM-96M achieves 3.75±0.02 FID with 1 NFE on CIFAR-10 (Tables 1 and related text).
- On 64×64 ImageNet, TRACT-96M achieves 7.43±0.07 FID with 1 NFE and TRACT-EDM-96M achieves 7.52±0.05 with the same setup (Tables 2 and related text).
- Ablations show best performance with a 2-phase schedule (1024→32→1) and EMA-based self-teaching; more phases degrade performance due to objective degeneracy.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。