[论文解读] Diffusion Language Models Are Natively Length-Aware
本论文提出 SmartCrop,一种在生成前裁剪扩散画布的零-shot 方法,通过从初始 EoS logits 提取长度信号,极大减少 FLOPs,在四个基准上几乎无损或略有提升性能。
Unlike autoregressive language models, which terminate variable-length generation upon predicting an End-of-Sequence (EoS) token, Diffusion Language Models (DLMs) operate over a fixed maximum-length context window for a predetermined number of denoising steps. However, this process is independent of the required response length, resulting in computational waste for the majority of short responses common in reasoning and chat tasks. To address this problem, we conjecture that the latent prompt representation contains sufficient information to estimate the required output length. We provide empirical evidence for this phenomenon and propose a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings. We evaluate our approach on four benchmarks with diverse tasks -- GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) -- revealing massive efficiency gains at minimal performance impact. We report significant reductions in FLOPs across all tasks, with no statistically significant performance degradation, and significant performance improvements in 2 out of 4 tasks.
研究动机与目标
- 说明固定长度画布与 EoS 填充导致的扩散语言模型推理浪费的必要性.
- 提出一种零-shot、模型原生的机制,通过潜在提示表示预测输出长度。
- 证明动态画布裁剪在很少或无任务性能损失的前提下降低计算量(FLOPs)。
- 在多样基准(GSM8K、HumanEval、IfEval、LongFormQA)上使用一个 8B 参数的扩散语言模型(LLaDA)评估该方法。
- 展示鲁棒性并分析对长度预测与填充的敏感性。)
提出的方法
- 将长度预测表述为沿着画布的终止累积概率对 EoS logits 的估计。
- 定义基于阈值的裁剪规则:在累积终止概率超过 tau(如 0.9)的位置处裁剪。
- 将 SmartCrop 作为一种无需重新训练、与架构无关的后处理步骤,应用于缩短的画布上的标准扩散去噪之前。
- 在 LLaDA 上对固定画布与裁剪画布在四个基准上进行评估,并报告 FLOPs 节省量与任务相关指标。
- 通过扰动裁剪长度并与随机长度基线比较,进行敏感性分析以验证实例特定长度预测。

实验结果
研究问题
- RQ1用 EoS 填充训练的 DLM 是否能暴露出一个内部的、受提示条件约束的输出长度信号?
- RQ2基于初始 EoS logits 的零-shot 画布裁剪是否能在不损害,甚至可能提升任务性能的情况下减少推理计算量?
- RQ3SmartCrop 在具有不同输出长度范畴的多样任务(推理、代码、指令执行、问答)上的表现如何?
- RQ4裁剪画布长度的鲁棒性对扰动有多大?
主要发现
| Benchmark | Method | L_p | Avg. Processed Length | Metric ↑ | FLOPs Saved % ↑ | Perf. Δ % ↑ |
|---|---|---|---|---|---|---|
| IfEval | FC | 87.2 | 1367.2 | 0.4801 | - | - |
| IfEval | SC-0.5 | 192.1 | 0.5342 | - | 98.47*** | +11.25* |
| IfEval | SC-0.75 | 208.0 | 0.5521 | - | 98.05*** | +14.99** |
| IfEval | SC-0.9 | 222.0 | 0.5459 | - | 97.64*** | +13.70** |
| IfEval | SC-0.95 | 230.5 | 0.5450 | - | 97.37*** | +13.50** |
| IfEval | SC-0.99 | 243.8 | 0.5694 | - | 96.92*** | +18.58*** |
| GSM8K | FC | 140.7 | 396.7 | 0.5616 | - | - |
| GSM8K | SC-0.5 | 239.2 | 0.5452 | - | 69.39*** | -2.92 |
| GSM8K | SC-0.75 | 261.2 | 0.5516 | - | 59.09*** | -1.77 |
| GSM8K | SC-0.9 | 278.8 | 0.5457 | - | 50.15*** | -2.83 |
| GSM8K | SC-0.95 | 288.5 | 0.5490 | - | 44.93*** | -2.25 |
| GSM8K | SC-0.99 | 302.8 | 0.5520 | - | 37.01*** | -1.71 |
| HumanEval | FC | 178.5 | 690.5 | 0.4592 | - | - |
| HumanEval | SC-0.5 | 488.2 | 0.4665 | - | 46.42*** | +1.59 |
| HumanEval | SC-0.75 | 506.7 | 0.4688 | - | 41.06*** | +2.08 |
| HumanEval | SC-0.9 | 521.9 | 0.4851 | - | 36.53*** | +5.65 |
| HumanEval | SC-0.95 | 531.0 | 0.4598 | - | 33.98*** | +0.13 |
| HumanEval | SC-0.99 | 543.6 | 0.4106 | - | 30.16*** | -10.59 |
| LongFormQA | FC | 77.6 | 589.6 | 0.1341 | - | - |
| LongFormQA | SC-0.5 | 155.1 | 0.2115 | - | 85.40*** | +57.72*** |
| LongFormQA | SC-0.75 | 164.4 | 0.2152 | - | 82.56*** | +60.48*** |
| LongFormQA | SC-0.9 | 172.7 | 0.2173 | - | 79.94*** | +62.01*** |
| LongFormQA | SC-0.95 | 177.5 | 0.2196 | - | 78.35*** | +63.73*** |
| LongFormQA | SC-0.99 | 185.2 | 0.2210 | - | 75.86*** | +64.83*** |
- SmartCrop 在所有任务上将 FLOPs 降低 46–98%,平均节省 67%。
- 在大多数任务中性能下降在统计上不显著;在 IfEval 与 LongFormQA 上观察到显著提升。
- 在 GSM8K 与 HumanEval 上,裁剪获得显著的计算节省,且指标性能损失极小或无损。
- 在 IfEval 上,较短的画布减少填充导致的退化,且精度提升。
- 在 LongFormQA 上,裁剪提高了 ROUGE-1,表明信息更简明且密度更高。
- 该方法在显著降低画布逐步去噪范围的同时,维持或提升性能。
![Figure 2 : Sensitivity of IfEval Performance to Context Length Perturbations. We analyze the robustness of SmartCrop ( $\tau=0.9$ ) by shifting the predicted length $\hat{L}$ by a deviation factor $\delta\in[-50\%,+50\%]$ . The blue curve shows the model performance (mean $\pm$ 95% CI) across these](https://ar5iv.labs.arxiv.org/html/2603.06123/assets/quality_length_sweep.png)
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。