Skip to main content
QUICK REVIEW

[論文レビュー] TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

David Berthelot, Arnaud Autef|arXiv (Cornell University)|Mar 7, 2023
Advanced Neuroimaging Techniques and Applications被引用数 12
ひとこと要約

TRACT significantly improves single- and few-step diffusion sampling by distilling through a transitive closure approach, achieving state-of-the-art FID scores for 1-step DDIM on CIFAR-10 and 64×64 ImageNet without changing architecture.

ABSTRACT

Denoising Diffusion models have demonstrated their proficiency for generative sampling. However, generating good samples often requires many iterations. Consequently, techniques such as binary time-distillation (BTD) have been proposed to reduce the number of network calls for a fixed architecture. In this paper, we introduce TRAnsitive Closure Time-distillation (TRACT), a new method that extends BTD. For single step diffusion,TRACT improves FID by up to 2.4x on the same architecture, and achieves new single-step Denoising Diffusion Implicit Models (DDIM) state-of-the-art FID (7.4 for ImageNet64, 3.8 for CIFAR10). Finally we tease apart the method through extended ablations. The PyTorch implementation will be released soon.

研究の動機と目的

  • Motivate and reduce inference cost of diffusion models by enabling single- or few-step sampling without architecture changes.
  • Identify limitations of binary time-distillation (BTD) such as objective degeneracy and SWA incompatibility.
  • Propose TRACT to distill outputs across time steps via transitive closure with self-teaching to maintain quality with few phases.
  • Show that TRACT achieves state-of-the-art or competitive FID with 1-2 steps on CIFAR-10 and 64×64 ImageNet and analyze ablations.

提案手法

  • Extend binary time-distillation (BTD) to Transitive Closure Time-Distillation (TRACT) reducing distillation phases from log2(T) to a small constant (1–2).
  • Train a student to distill the teacher’s inference from t to t' where t' < t using a self-teacher EMA to perform transitive closure (equations 6–9).
  • Use a self-teaching EMA of the student weights to generate targets for multi-step jumps (Algorithm 1).
  • Adapt TRACT to VE/EDM settings with RK and DDIM-VE teachers and derive corresponding targets and losses (equations 11–15).
  • Mitigate objective degeneracy by limiting distillation phases and leveraging self-teaching with EMA and inference-time EMA.
  • Provide training details including group-based distillation, loss weighting, and EMA updates (Appendix references).
Figure 1 : Transitive Closure Distillation of a group $\{t_{i},\ldots,t_{j}\}$ .
Figure 1 : Transitive Closure Distillation of a group $\{t_{i},\ldots,t_{j}\}$ .

実験結果

リサーチクエスチョン

  • RQ1Can TRACT achieve high-quality samples with 1–2 inference steps without architectural changes?
  • RQ2Does reducing the number of distillation phases mitigate objective degeneracy and enable effective SWA?
  • RQ3How does TRACT perform with VE/EDM teachers and alternative samplers (RK, DDIM-VE) across CIFAR-10 and 64×64 ImageNet?

主な発見

方法NFEsFIDパラメータ
TRACT-EDM-256M ∗13.78 ± 0.0156M
TRACT-96M ∗14.17 ± 0.0356M
TRACT-256M14.45 ± 0.0560M
BTD-96M [44]-9.1260M
TRACT-96M23.32 ± 0.0260M
TRACT-EDM-256M ∗23.55 ± 0.0156M
TRACT-EDM-96M ∗23.75 ± 0.0256M
BTD-96M [44]-4.5160M
TRACT-96M17.43 ± 0.07296M
TRACT-EDM-96M ∗17.52 ± 0.05296M
  • 1-step TRACT improves CIFAR-10 FID from 9.1 (BTD) to 4.5 on CIFAR-10 with the same architecture and budget for 1-step setups.
  • 1-step TRACT achieves 7.4 FID on 64×64 ImageNet with 1-step sampling using EDM teachers, improving over BTD baselines.
  • 2-step TRACT reaches 3.32 FID on CIFAR-10 with 32-step teacher distilled to 1-step, and 7.43 FID on 64×64 ImageNet with single-step distillation.
  • .TRACT-EDM-256M achieves 3.78±0.01 FID with 1 NFE on CIFAR-10; TRACT-EDM-96M achieves 3.75±0.02 FID with 1 NFE on CIFAR-10 (Tables 1 and related text).
  • On 64×64 ImageNet, TRACT-96M achieves 7.43±0.07 FID with 1 NFE and TRACT-EDM-96M achieves 7.52±0.05 with the same setup (Tables 2 and related text).
  • Ablations show best performance with a 2-phase schedule (1024→32→1) and EMA-based self-teaching; more phases degrade performance due to objective degeneracy.
Figure 5 : 1-step FID for 2-phases $T:1024\to 32\to 1$ TRACT distilled models. Each curve maps to a different way to set the inference time EMA momentum $\mu$ across training lengths. Dashed lines correspond to fixing a $\mu$ value, solid lines correspond to fixing $\epsilon=\mu^{N}$ .
Figure 5 : 1-step FID for 2-phases $T:1024\to 32\to 1$ TRACT distilled models. Each curve maps to a different way to set the inference time EMA momentum $\mu$ across training lengths. Dashed lines correspond to fixing a $\mu$ value, solid lines correspond to fixing $\epsilon=\mu^{N}$ .

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。