[Paper Review] Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation
Presents a two-stage acceleration framework for diffusion-based image tokenizers: (1) multi-scale coarse-to-fine sampling to achieve O(log n) decoding, and (2) per-scale distillation to a single-step denoiser, yielding large speedups with comparable fidelity.
Image tokenization plays a central role in modern generative modeling by mapping visual inputs into compact representations that serve as an intermediate signal between pixels and generative models. Diffusion-based decoders have recently been adopted in image tokenization to reconstruct images from latent representations with high perceptual fidelity. In contrast to diffusion models used for downstream generation, these decoders are dedicated to faithful reconstruction rather than content generation. However, their iterative sampling process introduces significant latency, making them impractical for real-time or large-scale applications. In this work, we introduce a two-stage acceleration framework to address this inefficiency. First, we propose a multi-scale sampling strategy, where decoding begins at a coarse resolution and progressively refines the output by doubling the resolution at each stage, achieving a theoretical speedup of $\mathcal{O}(\log n)$ compared to standard full-resolution sampling. Second, we distill the diffusion decoder at each scale into a single-step denoising model, enabling fast and high-quality reconstructions in a single forward pass per scale. Together, these techniques yield an order-of-magnitude reduction in decoding time with little degradation in output quality. Our approach provides a practical pathway toward efficient yet expressive image tokenizers. We hope it serves as a foundation for future work in efficient visual tokenization and downstream generation.
Motivation & Objective
- Motivate diffusion decoders for image tokenization and address slow inference.
- Propose a multi-scale, coarse-to-fine decoding scheme to reduce computation.
- Introduce per-scale distillation to single-step denoisers to cut latency.
- Demonstrate competitive reconstruction quality with major speedups on ImageNet-1K.
Proposed method
- Use a multi-scale diffusion decoder (MMDiT-based encoder–decoder) that denoises across S scales from low to high resolution (32x32 up to 256x256).
- Train with a velocity-field denoising objective guided by classifier-free guidance, performing Euler updates per timestep per scale.
- Stage-1 training jointly learns encoder and decoder; Stage-2 distills the decoder at each scale into a one-step denoiser per scale, conditioned on the same latent code.
- Distillation uses a frozen teacher, a student decoder, and a discriminator; losses include multi-scale reconstruction, perceptual (LPIPS), and adversarial terms.
- Stage-2 distillation reduces steps from 50–100 to 4 total (one per scale).
- Evaluation on ImageNet-1K (256x256) comparing against other tokenizers using rFID, PSNR, SSIM, and throughput.
Experimental results
Research questions
- RQ1Can diffusion tokenizers achieve real-time or near-real-time reconstruction without sacrificing perceptual fidelity?
- RQ2Does a coarse-to-fine multi-scale decoding reduce computational cost while preserving quality?
- RQ3Can per-scale distillation convert multi-step diffusion into one-step denoisers per scale without large quality loss?
- RQ4What are the trade-offs between reconstruction fidelity and decoding speed across scales?
- RQ5How does the proposed method compare to existing diffusion tokenizers and non-diffusion tokenizers in terms of rFID, PSNR, SSIM, and throughput?
Key findings
| Model | Num Tokens | rFID↓ | PSNR↑ | SSIM↑ | Throughput (img/s)↑ |
|---|---|---|---|---|---|
| Ours (After 1st stage) | 128 | 0.91 | 23.27 | 0.752 | 2.76 |
| Ours (After 2nd stage) | 128 | 1.09 | 24.74 | 0.800 | 87.16 |
| Diffusion FlowMo (FlowMo) | 256 | 0.95 | 22.07 | 0.649 | 1.44 |
| DiTo (DiTo) | 256 | 0.78 | 24.10 | 0.706 | 0.19 |
| Table 1: Tokenization comparison on ImageNet-1K at 256x256 resolution. | |||||
| TiTok-S-128 | 128 | 1.71 | 17.52 | 0.437 | 7.31 |
| LlamaGen-16 | 256 | 2.19 | 20.67 | 0.589 | 4.55 |
| Cosmos DI-16x16 | 256 | 4.40 | 19.98 | 0.536 | 9.55 |
- The multi-scale sampler achieves up to 10x speedup over full-resolution sampling with O(log n) decoding complexity.
- Per-scale distillation reduces total denoising steps to about 4 (one per scale) and yields over 30x faster decoding than the teacher.
- The distilled, multi-scale decoder achieves rFID around 1.09, PSNR 24.74, SSIM 0.80, and throughput 87.16 img/s on ImageNet-1K at 256x256.
- Compared to diffusion tokenizers DiTo and FlowMo, the distilled multi-scale model delivers up to 459x and 60x speedups respectively with competitive quality.
- Table 1 shows that the final-stage distilled model approaches the fidelity of diffusion tokenizers while surpassing many non-diffusion tokenizers in throughput.
- Table 2 demonstrates substantial throughput gains for teacher vs. distilled students, with modest rFID increases (e.g., 0.18 increase for the four-scale distilled model).
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.