[论文解读] DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention
tldr: DyLLM 通过选择显著 token 进行全重新计算并使用具显著性感知的近似注意力来加速扩散 LLM 推理,在 LLaDA 和 Dream 上实现高达 9.6x 的吞吐量,且精度接近基线。
Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.
研究动机与目标
- Motivate and address the inefficiency of MDLM inference caused by repeated full-sequence processing across diffusion steps.
- Leverage layer-wise temporal sparsity to identify salient tokens that require updates, while reuse of cached computations for stable tokens.
- Develop training-free inference techniques that maintain accuracy while significantly improving throughput.
- Propose a saliency-aware attention mechanism to further reduce attention overhead without sacrificing fidelity.
提出的方法
- Define layer-wise temporal sparsity via temporal cosine similarity of attention contexts across adjacent diffusion steps.
- Identify salient tokens at each layer where similarity falls below a threshold, recomputing FFN and attention only for these tokens.
- Reuse cached activations for non-salient tokens and apply a saliency-aware approximate attention to reduce quadratic attention cost.
- Propagate semantic deltas across layers using a two-path update: exact updates for salient tokens and approximate updates for non-salient tokens, reducing computation from O(N^2d) to O(N|A|d).
- Adopt response-focused saliency, prioritizing salient tokens among response tokens, and perform periodic full-sequence inputs only at fixed intervals.
实验结果
研究问题
- RQ1Can layer-wise temporal sparsity be exploited to accelerate diffusion LLM inference without large accuracy degradation?
- RQ2How can salient token detection based on attention context similarity be integrated into FFN and attention updates to reduce compute?
- RQ3What is the impact of saliency-aware approximate attention on overall generation quality and throughput across MDLMs?
- RQ4How does the approach scale with higher degrees of parallel decoding (n_u) and different model families (LLaDA, Dream)?
主要发现
- DyLLM achieves substantial throughput gains by recomputing only salient tokens across diffusion steps.
- Saliency-aware approximate attention maintains high fidelity while significantly reducing attention cost.
- Across LLaDA and Dream models, DyLLM delivers up to 7.6x and 9.6x throughput improvements respectively, with near-baseline accuracy on diverse benchmarks.
- The method scales robustly with increased parallel decoding (n_u) without requiring dataset- or model-specific tuning.
- Threshold choices (tau) enable controlled trade-offs between throughput and accuracy, with model-specific sweet spots (e.g., tau around 0.99–0.995).
- Salient tokens concentrate largely among response tokens, enabling efficient response-only refinement.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。