QUICK REVIEW

[论文解读] Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Vishnu Teja Kunde, Fatemeh Doudi|arXiv (Cornell University)|Mar 13, 2026

Topic Modeling被引用 0

一句话总结

该论文将扩散语言模型建模为有限时步的MDP，推导出带有逐步优势的精确无偏政策梯度，并引入熵引导的逐步选择与逐步优势，以实现对DLMs的可扩展强化学习，在编码与推理基准上达到最先进的结果。

ABSTRACT

Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.

研究动机与目标

将基于扩散的序列生成建模为覆盖去噪步骤的有限时域MDP。
推导在去噪步骤上分解的精确、无偏策略梯度，带有逐步优势。
提出利用扩散结构的实用、高效计算估计方法（熵引导的逐步选择）。
引入使用一步去噪奖励的逐步优势估计，以避免代价高昂的滚动推断。
在编码与逻辑推理基准上展示相较于先前对DLMs的RL方法的最先进结果。

提出的方法

将MDLM去噪过程建模为T步MDP，状态为 s_t = (x_{T-t}, q)，动作为 a_t = x_{T-t-1}。
推导策略梯度：∇_θ J(θ) = E[r(x_0,q) ∇_θ log π_θ(x|q)]，并分解为逐步优势 A_t。
提出熵引导的逐步选择：选取熵值最高的前K个步骤（H(π_θ^{t|t+1})），以熵为贪婪标准计算梯度。
将逐步优势定义为 A_t = r(x_0,q) − V_{t+1}^{π}(x_{t+1},q)，并用一步去噪近似 V_t：ŴV_t。
通过一步去噪分布 π_θ^{0|t} 来估计优势，避免多步滚动。
构造基于GRPO的损失 L(θ; θ_old)，对选定步骤 S 使用逐步截断替代项与KL正则化。

Figure 1 : Overview of the performance on coding and reasoning tasks. Our approach outperforms the existing baselines in coding and logical reasoning tasks, while maintaining competitive performance in mathematical reasoning tasks.

实验结果

研究问题

RQ1扩散基于序列生成的正确MDP形式是什么？
RQ2能否推导出对扩散语言模型在去噪步骤上分解的精确、无偏策略梯度？
RQ3扩散时序结构如何使逐步信用分配和计算分配变得可处理？
RQ4熵引导的逐步选择与逐步优势估计是否提升对DLM的RL微调效率与性能？
RQ5与现有对扩散语言模型的RL方法相比，所提方法在编码与推理基准上的表现如何？

主要发现

EGSPO与EGSPO-SA在推理任务上优于基础的LLaDA-8B-Instruct模型。
EGSPO-SA在逻辑推理基准（如数独与Countdown）上取得最强总体表现。
在编码基准（MBPP、HumanEval）上，两种方法在各个生成长度均超越基线，且EGSPO-SA总体最强。
在数学推理任务（GSM8K、MATH500）上提升有限且与先前的扩散RL方法一致。
EGSPO-SA在计算效率上优于以往方法（较少的FLOPs、样本和梯度步数）。
消融研究表明熵引导的逐步选择优于均匀逐步选择，并凸显逐步信用的重要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。