QUICK REVIEW

[论文解读] Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation

Chuhan Wang, Hao Chen|arXiv (Cornell University)|Mar 20, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

呈现一个面向扩散式图像分词器的两阶段加速框架： (1) 多尺度自粗到精的采样实现 O(log n) 的解码，(2) 每尺度的蒸馏到单步去噪器，带来在保有相近保真度的前提下的大幅加速。

ABSTRACT

Image tokenization plays a central role in modern generative modeling by mapping visual inputs into compact representations that serve as an intermediate signal between pixels and generative models. Diffusion-based decoders have recently been adopted in image tokenization to reconstruct images from latent representations with high perceptual fidelity. In contrast to diffusion models used for downstream generation, these decoders are dedicated to faithful reconstruction rather than content generation. However, their iterative sampling process introduces significant latency, making them impractical for real-time or large-scale applications. In this work, we introduce a two-stage acceleration framework to address this inefficiency. First, we propose a multi-scale sampling strategy, where decoding begins at a coarse resolution and progressively refines the output by doubling the resolution at each stage, achieving a theoretical speedup of $\mathcal{O}(\log n)$ compared to standard full-resolution sampling. Second, we distill the diffusion decoder at each scale into a single-step denoising model, enabling fast and high-quality reconstructions in a single forward pass per scale. Together, these techniques yield an order-of-magnitude reduction in decoding time with little degradation in output quality. Our approach provides a practical pathway toward efficient yet expressive image tokenizers. We hope it serves as a foundation for future work in efficient visual tokenization and downstream generation.

研究动机与目标

推动图像分词的扩散解码器的研究并解决推理速度慢的问题。
提出多尺度自粗到细的解码方案以降低计算量。
引入每尺度蒸馏至单步去噪器以降低延迟。
在 ImageNet-1K 上展示具有竞争力的重建质量与显著的速度提升。

提出的方法

使用多尺度扩散解码器（基于 MMDiT 的编码器–解码器），对从低分辨率到高分辨率（32x32 至 256x256）的 S 个尺度进行去噪。
在由分类器自由引导引导的速度场去噪目标下训练，在每个时间步、每个尺度执行 Euler 更新。
阶段1 训练联合学习编码器和解码器；阶段2 将每个尺度的解码器蒸馏成同一潜在代码条件下的单步去噪器。
蒸馏使用冻结的教师模型、学生解码器和判别器；损失包括多尺度重建、感知（LPIPS）及对抗项等。
阶段2 蒸馏将步骤从 50–100 减少到总计 4 步（每个尺度各1步）。
在 ImageNet-1K（256x256）上对比其他分词器，使用 rFID、PSNR、SSIM 与吞吐量进行评估。

实验结果

研究问题

RQ1扩散分词器是否能够在不牺牲感知保真度的前提下实现实时或近实时重建？
RQ2自粗到细的多尺度解码是否在降低计算成本的同时保持质量？
RQ3每尺度蒸馏是否能将多步扩散转化为每尺度的单步去噪而不造成明显的质量损失？
RQ4在不同尺度上，重建保真度与解码速度之间的权衡如何？
RQ5与现有扩散分词器和非扩散分词器相比，在 rFID、PSNR、SSIM 与吞吐量方面效果如何？

主要发现

模型	令牌数	rFID↓	PSNR↑	SSIM↑	吞吐量（img/s）↑
我们的方法（第一阶段后）	128	0.91	23.27	0.752	2.76
我们的方法（第二阶段后）	128	1.09	24.74	0.800	87.16
Diffusion FlowMo（FlowMo）	256	0.95	22.07	0.649	1.44
DiTo（DiTo）	256	0.78	24.10	0.706	0.19
表 1: ImageNet-1K 256x256 分辨率下的分词对比。
TiTok-S-128	128	1.71	17.52	0.437	7.31
LlamaGen-16	256	2.19	20.67	0.589	4.55
Cosmos DI-16x16	256	4.40	19.98	0.536	9.55

多尺度采样器在保持 O(log n) 解码复杂度的前提下，最大可实现对全分辨率采样的约 10x 加速。
每尺度蒸馏将总去噪步数降至大约 4 步（每尺度 1 步），比教师模型的解码快超过 30x。
蒸馏后的多尺度解码器在 ImageNet-1K（256x256）上达到 rFID 约 1.09、PSNR 24.74、SSIM 0.80、吞吐量 87.16 img/s。
与扩散分词器 DiTo 和 FlowMo 相比，蒸馏的多尺度模型分别实现最高约 459x 和 60x 的加速，同时保持具有竞争力的质量。
表 1 显示最终阶段蒸馏模型在保真度方面接近扩散分词器，同时在吞吐量方面超过许多非扩散分词器。
表 2 显示教师与蒸馏后学生之间的显著吞吐量提升，同时 rFID 略有增加（例如四尺度蒸馏模型增加 0.18）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。