QUICK REVIEW

[論文レビュー] Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Vishnu Teja Kunde, Fatemeh Doudi|arXiv (Cornell University)|Mar 13, 2026

Topic Modeling被引用数 0

ひとこと要約

論文は拡散言語モデルを有限ホライズンMDPとして定式化し、段階的利得を用いた正確で偏りのないポリシー勾配を導出し、エントロピー誘導型のステップ選択と段階的利得を導入してDLMsのスケーラブルなRLを可能にし、コーディングと推論のベンチマークで最先端の結果を達成する。

ABSTRACT

Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.

研究の動機と目的

拡散ベースのシーケンス生成をノイズ除去ステップの有限ホライズンMDPとして定式化する。
ノイズ除去ステップに分解された正確で偏りのないポリシー勾配を導出する。
拡散構造を活用した実用的で計算効率の高い推定量を提案する（エントロピー誘導型ステップ選択）。
高コストなロールアウトを回避するため、1ステップのノイズ除去報酬を用いた段階的利得の推定を導入する。
従来のRLアプローチと比較して、コーディングと論理推論ベンチマークで最先端の結果を示す。

提案手法

MDLMのノイズ除去過程をTステップMDPとしてモデル化する。状態は s_t = (x_{T-t}, q)、行動は a_t = x_{T-t-1}。
ポリシー勾配を導出する： ∇_θ J(θ) = E[r(x_0,q) ∇_θ log π_θ(x|q)]、段階的利得 A_t を用いて分解。
エントロピー誘導型ステップ選択を提案：エントロピー H(π_θ^{t|t+1}) が高い上位Kステップを選択して勾配を計算（エントロピー最大化による貪欲）。
段階的利得 A_t = r(x_0,q) − V_{t+1}^{π}(x_{t+1},q) と、V_t を1ステップのノイズ除去で近似： ŴV_t。
1ステップのノイズ除去分布 π_θ^{0|t} を用いて利得を推定し、マルチステップのロールアウトを回避する。
Sで選択されたステップを用いて、各ステップのクリップドサロゲート項とKL正則化を含むGRPOベースのロス L(θ; θ_old) を形成する。

Figure 1 : Overview of the performance on coding and reasoning tasks. Our approach outperforms the existing baselines in coding and logical reasoning tasks, while maintaining competitive performance in mathematical reasoning tasks.

実験結果

リサーチクエスチョン

RQ1拡散ベースのシーケンス生成に対する正しいMDP形式は何か。
RQ2拡散LMの正確で偏りのないポリシー勾配をノイズ除去ステップに分解して導出できるか。
RQ3拡散時刻構造はどのようにして段階的クレジット割り当てと計算資源配分を実現するか。
RQ4エントロピー誘導型ステップ選択と段階的利得推定はDLMのRL微調整の効率と性能を改善するか。
RQ5提案手法は従来の拡散LMに対するRL手法と比較してコーディング・推論ベンチマークでどの程度効果を示すか。

主な発見

EGSPOとEGSPO-SAは、推論タスク全般でベースのLLaDA-8B-Instructモデルを上回る。
EGSPO-SAは、スードクやCountdownなどの論理推論ベンチマークで最も強力な総合性能を達成。
コーディングベンチマーク（MBPP、HumanEval）では、両手法とも長さに対する生成能力でベースラインを上回り、EGSPO-SAが全体的に最も強い。
数学的推論タスク（GSM8K、MATH500）では、利益は控えめで、先行する拡散RL手法と整合的。
EGSPO-SAは先行法より計算効率に優れ、FLOPs・サンプル・勾配ステップが少なく済む。
アブレーション研究により、エントロピー誘導型ステップ選択が一様なステップ選択より優れており、段階的クレジットの重要性が示される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。