QUICK REVIEW

[논문 리뷰] Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Vishnu Teja Kunde, Fatemeh Doudi|arXiv (Cornell University)|2026. 03. 13.

Topic Modeling인용 수 0

한 줄 요약

본 논문은 diffusion 언어 모델을 유한-horizon MDP로 형식화하고, 단계별 이점을 갖는 정확하고 편향되지 않은 정책 기울기를 도출하며, 엔트로피-가이드된 단계 선택과 단계별 이점을 도입하여 DLMs에 대한 확장 가능한 RL을 가능하게 하고, 코딩 및 추론 벤치마크에서 최첨단 성능을 달성한다.

ABSTRACT

Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.

연구 동기 및 목표

diffusion 기반 시퀀스 생성을 denoising 단계에 걸친 유한-대상 MDP로 형식화한다.
denoising 단계별로 분해된 정확하고 편향되지 않은 정책 기울기를 도출한다.
확산 구조를 활용한 실용적이고 계산 효율적인 추정기(엔트로피-가이드된 단계 선택)를 제안한다.
무비용 롤아웃을 피하기 위한 one-step denoising을 이용한 단계별 이점 추정(A_t)을 도입한다.
이전 RL 접근법과 비교하여 코딩 및 논리적 추론 벤치마크에서 최첨단 결과를 입증한다.

제안 방법

MDLM의 denoising 과정을 T-step MDP로 모델링하되 상태를 s_t = (x_{T-t}, q), 행동을 a_t = x_{T-t-1}로 정의한다.
정책 기울기: ∇_θ J(θ) = E[r(x_0,q) ∇_θ log π_θ(x|q)], 이를 단계별 이점 A_t로 분해한다.
엔트로피-가이드된 단계 선택을 제안한다: 엔트로피 H(π_θ^{t|t+1})가 가장 높은 상위-K 단계를 선택하여 그래디언트를 계산한다(엔트로피에 의한 탐욕적 방법).
A_t = r(x_0,q) − V_{t+1}^{π}(x_{t+1},q)로 정의하고 V_t를 one-step denoising: ŴV_t로 근사한다.
다중 스텝 롤아웃을 피하기 위해 one-step denoising 분포 π_θ^{0|t}를 통해 이점을 추정한다.
선택된 단계 S를 사용하여 per-step 클리핑 서포라이즈 항과 KL 정규화를 갖는 GRPO 기반 손실 L(θ; θ_old)을 구성한다.

Figure 1 : Overview of the performance on coding and reasoning tasks. Our approach outperforms the existing baselines in coding and logical reasoning tasks, while maintaining competitive performance in mathematical reasoning tasks.

실험 결과

연구 질문

RQ1diffusion 기반 시퀀스 생성을 위한 올바른 MDP 형식은 무엇인가?
RQ2denoising 단계별로 분해된 정확하고 편향되지 않은 정책 기울기를 도출할 수 있는가?
RQ3확산 시간 구조가 단계별 크레딧 배정과 계산 자원 배분을 어떻게 가능하게 하는가?
RQ4엔트로피-가이드된 단계 선택과 단계별 이점 추정이 DLM의 RL 미세조정 효율성과 성능을 개선하는가?
RQ5제안된 방법들이 기존의 diffusion LMs용 RL 접근법과 비교해 코딩 및 추론 벤치마크에서 어떤 성과를 보이는가?

주요 결과

EGSPO 및 EGSPO-SA는 추론 과제 전반에서 기본 LLaDA-8B-Instruct 모델보다 성능이 향상된다.
EGSPO-SA는 Sudoku 및 Countdown과 같은 논리적 추론 벤치마크에서 가장 강한 전반 성능을 달성한다.
코딩 벤치마크(MBPP, HumanEval)에서 두 방법 모두 생성 길이에 대해 베이스라인을 상회하며, EGSPO-SA가 전반적으로 가장 강력하다.
수학적 추론 작업(GSM8K, MATH500)에서 이익은 다소 보수적이며 이전의 diffusion RL 방법과 일관된다.
EGSPO-SA는 이전 접근법 대비 계산 효율이 우수하다(더 적은 FLOPs, 샘플, 그래디언트 스텝).
변형 연구에서 엔트로피-가이드된 단계 선택이 균일한 단계 선택보다 우수하며 단계별 크레딧의 중요성을 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.