QUICK REVIEW

[论文解读] Auto-Regressive Masked Diffusion Models

Mahdi Karami, Ali Ghodsi|arXiv (Cornell University)|Jan 23, 2026

Topic Modeling被引用 0

一句话总结

ARMD 通过将屏蔽扩散重新表述为分块因果模型，并引入严格因果、置换等变架构、具有步进并行生成，从而实现自回归解码。

ABSTRACT

Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the Auto-Regressive Masked Diffusion (ARMD) model, an architecture designed to close this gap by unifying the training efficiency of autoregressive models with the parallel generation capabilities of diffusion-based models. Our key insight is to reframe the masked diffusion process as a block-wise causal model. This perspective allows us to design a strictly causal, permutation-equivariant architecture that computes all conditional probabilities across multiple denoising steps in a single, parallel forward pass. The resulting architecture supports efficient, autoregressive-style decoding and a progressive permutation training scheme, allowing the model to learn both canonical left-to-right and random token orderings. Leveraging this flexibility, we introduce a novel strided parallel generation strategy that accelerates inference by generating tokens in parallel streams while maintaining global coherence. Empirical results demonstrate that ARMD achieves state-of-the-art performance on standard language modeling benchmarks, outperforming established diffusion baselines while requiring significantly fewer training steps. Furthermore, it establishes a new benchmark for parallel text generation, effectively bridging the performance gap between parallel and sequential decoding.

研究动机与目标

在语言建模中激发屏蔽扩散模型（MDMs）与自回归模型（ARMs）之间的差距的动机。
提出一种严格因果、置换等变的架构，使在单次前向传播中即可对所有条件进行并行评估。
实现带有键值缓存的自回归风格解码与步进并行生成策略。
提供支持从标准从左到右和随机令牌顺序学习的训练方案。
展示在与扩散基线相比以更少的训练步骤实现的最先进性能。

提出的方法

将屏蔽扩扩散重新表述为分块因果模型，从而在单次前向传播中对所有条件进行并行评估。
引入具有严格因果层和双流注意力机制（因果与严格因果）的因果、置换等变、基于注意力的架构。
通过渐进置换训练实现混合训练，从左到右和随机顺序学习。
在推断阶段引入KV缓存以支持高效的自回归风格解码。
Develop a strided parallel generation (SBP) strategy that generates tokens in parallel streams while maintaining global coherence.

实验结果

研究问题

RQ1屏蔽扩散模型是否可以重新表述为分块因果模型，以实现对所有条件的并行评估？
RQ2严格因果、置换等变的架构是否能在语言建模上提升训练效率与性能，相较于现有的MDMs？
RQ3步进并行生成在速度与质量上是否可以缩小扩散式解码与自回归解码之间的差距？
RQ4渐进置换训练如何影响从标准和随机令牌顺序的学习？

主要发现

ARMD 在标准语言建模基准上实现了最先进的性能。
ARMD 在需要显著更少训练步骤的情况下，性能超过了已建立的扩散基线。
该模型通过弥合并行解码与顺序解码之间的性能差距，确立了文本并行生成的新基准。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。