QUICK REVIEW

[论文解读] Scaling Beyond Masked Diffusion Language Models

Subham Sekhar Sahoo, Jean-Marie Lemercier|arXiv (Cornell University)|Feb 16, 2026

Topic Modeling被引用 0

一句话总结

该论文在三个离散扩散大语言模型家族（Masked Diffusion、Uniform-State Diffusion、Interpolating Diffusion）上进行缩放律研究，表明困惑度在跨家族之间不可直接比较，并强调采样速度–质量权衡，包括在 Uniform-State diffusion 在某些任务上表现出色的 1.7B 参数结果。

ABSTRACT

Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/scaling-dllms

研究动机与目标

说明为什么扩散模型是比自回归模型在语言任务中实现更快生成的可行替代方案。
系统性地使用计算对齐的缩放分析比较三类扩散家族（Masked、Uniform-State、Interpolating）。
量化训练目标和采样方法如何影响各家族的效率与吞吐量。
在 1.7B 参数下对基于似然的基准和推理数据集（如 GSM8K）评估可扩展性。
挑战仅凭困惑度就认定 Masked diffusion 在本质上优于其他方法的观念。

提出的方法

研究三大前沿扩散模型家族：Masked Diffusion (MDLM)、Uniform-State Diffusion (Duo)、与 Interpolating Diffusion (Eso-LM)。
进行计算对齐的缩放分析，以拟合各家族在验证损失与模型规模上的缩放规律。
通过测量吞吐量与采样步骤的样本质量来评估速度–质量权衡，并构建帕累托前沿。
在 1.7B 参数下对基于似然的基准和数学/推理数据集（GSM8K）评估性能。
研究训练目标变体（如低方差 MDLM 损失）及其对计算效率的影响。

实验结果

研究问题

RQ1在考虑跨家族缩放和实际采样效率时，Masked Diffusion 是否仍然是支配性扩散范式？
RQ2Uniform-State 与 Interpolating Diffusion 在困惑度、采样速度和下游任务性能方面如何与 Masked diffusion 相比？
RQ3低方差训练目标是否能提高 MDLM 的计算效率并将最优计算点转移到更小的模型？
RQ4在计算对齐条件下，对 MDLM、Duo、 Eso-LM 的相对缩放指数与常数是什么？
RQ5在不同计算预算与任务下，哪一扩散家族能够获得最佳的速度–质量帕累托前沿？

主要发现

困惑度在同一扩散家族内有信息量，但跨家族时具有误导性，可能更偏好更快、更实用的采样。
Uniform-State diffusion 在基于似然的基准上保持竞争力，在对 GSM8K 进行大规模有监督微调后甚至可超越 AR 与 MDLM。
低方差训练目标可以减小 MDLM 的训练方差，并将计算最优点向更小的模型偏移，从而在 FLOPs 上降低推理成本约 12%。
在 1.7B 参数下，Duo 在若干计算情境中主导吞吐量，并在微调后展现出强大的数学/推理性能，尽管其验证困惑度更差。
解读速度–质量前沿表明，快速采样与引导能力可以使跨家族困惑度较差的扩散家族在实际中具有竞争力甚至优越性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。