QUICK REVIEW

[논문 리뷰] Auto-Regressive Masked Diffusion Models

Mahdi Karami, Ali Ghodsi|arXiv (Cornell University)|2026. 01. 23.

Topic Modeling인용 수 0

한 줄 요약

ARMD는 마스킹된 확산을 블록 단위의 인과 모델로 재구성하고, 엄격히 인과적이며 순열 등가(permutation-equivariant) 아키텍처를 도입하여 스트라이드 병렬 생성을 통해 확산 기반 학습의 효율성과 자기회귀 디코딩을 통합한다.

ABSTRACT

Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the Auto-Regressive Masked Diffusion (ARMD) model, an architecture designed to close this gap by unifying the training efficiency of autoregressive models with the parallel generation capabilities of diffusion-based models. Our key insight is to reframe the masked diffusion process as a block-wise causal model. This perspective allows us to design a strictly causal, permutation-equivariant architecture that computes all conditional probabilities across multiple denoising steps in a single, parallel forward pass. The resulting architecture supports efficient, autoregressive-style decoding and a progressive permutation training scheme, allowing the model to learn both canonical left-to-right and random token orderings. Leveraging this flexibility, we introduce a novel strided parallel generation strategy that accelerates inference by generating tokens in parallel streams while maintaining global coherence. Empirical results demonstrate that ARMD achieves state-of-the-art performance on standard language modeling benchmarks, outperforming established diffusion baselines while requiring significantly fewer training steps. Furthermore, it establishes a new benchmark for parallel text generation, effectively bridging the performance gap between parallel and sequential decoding.

연구 동기 및 목표

Masked diffusion 모델(MDMs)과 언어 모델링에서의 autoregressive 모델(ARMs) 간의 격차를 동기 부여한다.
엄격히 인과적이고 순열-동등(permutation-equivariant) 아키텍처를 제안하여 모든 조건부를 단일 순전파에서 병렬 평가할 수 있게 한다.
키-값 캐싱과 스트라이드 병렬 생성 전략으로 autoregressive 스타일 디코딩을 가능하게 한다.
정본 왼쪽에서 오른쪽으로의 순서 및 임의의 토큰 순서를 학습에 활용할 수 있는 학습 방식을 제공한다.
확산 기반 베이스라인 대비 학습 단계 수를 줄이고 최첨단 성능을 보여준다.

제안 방법

마스크된 확산을 블록 단위의 인과 모델로 재구성하여 한 번의 순전파에서 모든 조건부를 병렬 평가할 수 있게 한다.
엄격히 인과적 계층과 두 흐름 주의 메커니즘(인과적 및 엄격한 인과적)으로 구성된 인과적, 순열-동등(permutation-equivariant) 주의 기반 아키텍처를 도입한다.
좌우로의 순서를 학습하기 위해 점진적 순열 학습을 통한 하이브리드 학습을 가능하게 한다.
추론 시 효율적인 autoregressive 스타일 디코딩을 지원하기 위해 KV 캐싱을 도입한다.
전역 일관성을 유지하면서 토큰을 병렬 스트림으로 생성하는 스트라이드 병렬 생성(SBP) 전략을 개발한다.

실험 결과

연구 질문

RQ1마스크된 확산 모델을 블록 단위 인과 모델로 재구성하여 모든 조건부를 병렬로 평가할 수 있는가?
RQ2엄격히 인과적이고 순열-동등한 아키텍처가 기존 MDM보다 학습 효율성과 언어 모델링 성능을 향상시키는가?
RQ3스트라이드 병렬 생성이 확산 기반 디코딩과 자기회귀 디코딩 간의 속도와 품질 차이를 좁힐 수 있는가?
RQ4점진적 순열 학습이 정본 및 임의의 토큰 순서를 모두 학습하는 데 어떤 영향을 미치는가?

주요 결과

ARMD가 표준 언어 모델링 벤치마크에서 최첨단 성능을 달성한다.
ARMD가 확산 기준선보다 훨씬 적은 학습 단계로 우수한 성능을 보인다.
모델은 병렬 텍스트 생성의 새로운 벤치마크를 확립하며 병렬 디코딩과 순차 디코딩 간의 성능 차이를 좁힌다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.