QUICK REVIEW

[論文レビュー] TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

David Berthelot, Arnaud Autef|arXiv (Cornell University)|Mar 7, 2023

Advanced Neuroimaging Techniques and Applications被引用数 12

ひとこと要約

TRACT significantly improves single- and few-step diffusion sampling by distilling through a transitive closure approach, achieving state-of-the-art FID scores for 1-step DDIM on CIFAR-10 and 64×64 ImageNet without changing architecture.

ABSTRACT

Denoising Diffusion models have demonstrated their proficiency for generative sampling. However, generating good samples often requires many iterations. Consequently, techniques such as binary time-distillation (BTD) have been proposed to reduce the number of network calls for a fixed architecture. In this paper, we introduce TRAnsitive Closure Time-distillation (TRACT), a new method that extends BTD. For single step diffusion,TRACT improves FID by up to 2.4x on the same architecture, and achieves new single-step Denoising Diffusion Implicit Models (DDIM) state-of-the-art FID (7.4 for ImageNet64, 3.8 for CIFAR10). Finally we tease apart the method through extended ablations. The PyTorch implementation will be released soon.

研究の動機と目的

Motivate and reduce inference cost of diffusion models by enabling single- or few-step sampling without architecture changes.
Identify limitations of binary time-distillation (BTD) such as objective degeneracy and SWA incompatibility.
Propose TRACT to distill outputs across time steps via transitive closure with self-teaching to maintain quality with few phases.
Show that TRACT achieves state-of-the-art or competitive FID with 1-2 steps on CIFAR-10 and 64×64 ImageNet and analyze ablations.

提案手法

Extend binary time-distillation (BTD) to Transitive Closure Time-Distillation (TRACT) reducing distillation phases from log2(T) to a small constant (1–2).
Train a student to distill the teacher’s inference from t to t' where t' < t using a self-teacher EMA to perform transitive closure (equations 6–9).
Use a self-teaching EMA of the student weights to generate targets for multi-step jumps (Algorithm 1).
Adapt TRACT to VE/EDM settings with RK and DDIM-VE teachers and derive corresponding targets and losses (equations 11–15).
Mitigate objective degeneracy by limiting distillation phases and leveraging self-teaching with EMA and inference-time EMA.
Provide training details including group-based distillation, loss weighting, and EMA updates (Appendix references).

Figure 1 : Transitive Closure Distillation of a group $\{t_{i},\ldots,t_{j}\}$ .

実験結果

リサーチクエスチョン

RQ1Can TRACT achieve high-quality samples with 1–2 inference steps without architectural changes?
RQ2Does reducing the number of distillation phases mitigate objective degeneracy and enable effective SWA?
RQ3How does TRACT perform with VE/EDM teachers and alternative samplers (RK, DDIM-VE) across CIFAR-10 and 64×64 ImageNet?

主な発見

方法	NFEs	FID	パラメータ
TRACT-EDM-256M ∗	1	3.78 ± 0.01	56M
TRACT-96M ∗	1	4.17 ± 0.03	56M
TRACT-256M	1	4.45 ± 0.05	60M
BTD-96M [44]	-	9.12	60M
TRACT-96M	2	3.32 ± 0.02	60M
TRACT-EDM-256M ∗	2	3.55 ± 0.01	56M
TRACT-EDM-96M ∗	2	3.75 ± 0.02	56M
BTD-96M [44]	-	4.51	60M
TRACT-96M	1	7.43 ± 0.07	296M
TRACT-EDM-96M ∗	1	7.52 ± 0.05	296M

1-step TRACT improves CIFAR-10 FID from 9.1 (BTD) to 4.5 on CIFAR-10 with the same architecture and budget for 1-step setups.
1-step TRACT achieves 7.4 FID on 64×64 ImageNet with 1-step sampling using EDM teachers, improving over BTD baselines.
2-step TRACT reaches 3.32 FID on CIFAR-10 with 32-step teacher distilled to 1-step, and 7.43 FID on 64×64 ImageNet with single-step distillation.
.TRACT-EDM-256M achieves 3.78±0.01 FID with 1 NFE on CIFAR-10; TRACT-EDM-96M achieves 3.75±0.02 FID with 1 NFE on CIFAR-10 (Tables 1 and related text).
On 64×64 ImageNet, TRACT-96M achieves 7.43±0.07 FID with 1 NFE and TRACT-EDM-96M achieves 7.52±0.05 with the same setup (Tables 2 and related text).
Ablations show best performance with a 2-phase schedule (1024→32→1) and EMA-based self-teaching; more phases degrade performance due to objective degeneracy.

Figure 5 : 1-step FID for 2-phases $T:1024\to 32\to 1$ TRACT distilled models. Each curve maps to a different way to set the inference time EMA momentum $\mu$ across training lengths. Dashed lines correspond to fixing a $\mu$ value, solid lines correspond to fixing $\epsilon=\mu^{N}$ .

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。