QUICK REVIEW

[論文レビュー] Difficulty-Estimated Policy Optimization

Yu Zhao, Fan Jiang|arXiv (Cornell University)|Feb 6, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

DEPO はロールアウト前に低有用な学習サンプルをフィルタリングするオンライン難易度推定機を導入し、数学的推論ベンチマークで性能を維持・向上させつつロールアウトコストを最大 2 倍削減できる。

ABSTRACT

Recent advancements in Large Reasoning Models (LRMs), exemplified by DeepSeek-R1, have underscored the potential of scaling inference-time compute through Group Relative Policy Optimization (GRPO). However, GRPO frequently suffers from gradient signal attenuation when encountering problems that are either too trivial or overly complex. In these scenarios, the disappearance of inter-group advantages makes the gradient signal susceptible to noise, thereby jeopardizing convergence stability. While variants like DAPO attempt to rectify gradient vanishing, they do not alleviate the substantial computational overhead incurred by exhaustive rollouts on low-utility samples. In this paper, we propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment. DEPO integrates an online Difficulty Estimator that dynamically assesses and filters training data before the rollout phase. This mechanism ensures that computational resources are prioritized for samples with high learning potential. Empirical results demonstrate that DEPO achieves up to a 2x reduction in rollout costs without compromising model performance. Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling. Code and data will be released upon acceptance.

研究の動機と目的

GRPO ベースの RLVR におけるロールアウトコストと勾配ノイズの低減を動機づける。
rollout 前にデータをフィルタリングするサンプル利得を予測するオンライン難易度推定器を提案する。
フィルタリングが学習信号を保持しつつ効率性と安定性を向上させることを示す。
既存の RLVR フレームワークおよびデータ選別戦略とのプラグアンドプレイ互換性を実演する。

提案手法

GRPO に軽量な BERT ベースの難易度推定器を統合し、各プロンプトの推定利得を予測する。
推定器を actor とオンラインで同時訓練し、3 成分のジョイント目的で学習する：利得推定損失、蒸留損失（actor perplexity）、および難易度の正しい順序付けを強制するペアワイズランキング損失。
ロールアウト前にゼロ利得サンプルを除外して計算を削減しつつ、GRPO による利得で推定器を更新する。
フィルタリングなしの推定器ウォームアップ段階と、その後のオンラインフィルタリングを含む2段階訓練を採用する。
信頼度閾値に基づいて推定器を用いたオンラインモデルルータ機能を提供し、 queries を異種モデルへルーティングする。

実験結果

リサーチクエスチョン

RQ1オンライン難易度推定は GRPO ベースの RLVR においてロールアウトコストを削減しつつ推論性能を損なわないか？
RQ2利得推定、蒸留、ランキングという異なる訓練目的が難易度推定器の質と安定性にどう影響するか？
RQ3オンラインフィルタリングが数学的推論ベンチマーク全体で訓練効率と学習信号の質に与える影響は？
RQ4推定器はオンラインルータとして、容量の異なるモデル間で精度と効率のバランスを取れるか？

主な発見

データセット	手法	GSM8K	MATH	AMC23	オリンピック	Minerva	平均	GPU 時間 ↓
DAPO-MATH-17K	Qwen2.5-1.5B-Instruct	75.6	48.1	38.4	15.8	11.4	37.9	528 (1.0 ×)
DAPO	-	78.5	50.1	39.3	17.8	13.1	39.8	905 (1.7 ×)
Polaris	-	77.1	47.3	40.8	16.4	11.8	38.7	584 (1.1 ×)
DEPO	-	77.0	48.9	42.3	16.7	12.2	39.4	530 (1.0 ×)
– ランキング損失	-	76.6	48.0	40.9	16.3	12.1	38.8	-
– 蒸留損失	-	75.2	48.0	39.0	15.9	12.0	38.0	-
+ DAPO W/O Dynamic Sampling	-	78.3	50.6	41.7	17.5	13.3	40.3	-
Qwen2.5-7B-Instruct	GRPO	91.9	64.1	63.4	27.9	25.0	54.5	776 (1.0 ×)
Qwen2.5-7B-Instruct	DEPO	92.3	63.9	63.5	28.7	25.5	54.8	782 (1.0 ×)
OR1	Qwen2.5-7B-Instruct	GRPO	92.0	63.3	48.9	26.4	26.2	51.4	-
OR1	Qwen2.5-7B-Instruct	DEPO	91.8	64.0	51.0	27.6	26.6	52.2	-
NT	Qwen2.5-7B-Instruct	GRPO	90.1	62.7	48.9	25.3	23.8	50.1	-
NT	Qwen2.5-7B-Instruct	DEPO	90.8	63.2	53.2	25.6	25.0	51.6	-

DEPO は複数の数学的推論ベンチマークで同等またはそれ以上の精度を維持しつつロールアウトコストを削減する。
DEPO は競合ベースライン（例：DAPO）と比較して最大 2 倍のロールアウト効率の向上を実現し、GRPO と同程度の総訓練待機時間を維持する。
ランキング損失と蒸留損失の両方を組み込むことが頑健な難易度識別と下流パフォーマンスの改善に不可欠である。
オンライン難易度推定器は短いウォームアップ後に収束し、真の報酬を密接に追跡することで低情報 prompts の有効な剪定を実現する。
DEPO は既存手法と直交的で補完的であり、それらと組み合わせることでさらなる性能向上が期待できる。
DEPO をオンラインルータとして用いると、大規模モデルと競合する性能を示しつつ、多くのクエリを小さく安価なモデルへルーティングすることができる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。