QUICK REVIEW

[论文解读] Difficulty-Estimated Policy Optimization

Yu Zhao, Fan Jiang|arXiv (Cornell University)|Feb 6, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

DEPO 引入一个在线难度估计器，在滚动评估前筛选低效用的训练样本，在数学推理基准上保持或提升性能的同时实现最多约2×的滚动成本下降。

ABSTRACT

Recent advancements in Large Reasoning Models (LRMs), exemplified by DeepSeek-R1, have underscored the potential of scaling inference-time compute through Group Relative Policy Optimization (GRPO). However, GRPO frequently suffers from gradient signal attenuation when encountering problems that are either too trivial or overly complex. In these scenarios, the disappearance of inter-group advantages makes the gradient signal susceptible to noise, thereby jeopardizing convergence stability. While variants like DAPO attempt to rectify gradient vanishing, they do not alleviate the substantial computational overhead incurred by exhaustive rollouts on low-utility samples. In this paper, we propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment. DEPO integrates an online Difficulty Estimator that dynamically assesses and filters training data before the rollout phase. This mechanism ensures that computational resources are prioritized for samples with high learning potential. Empirical results demonstrate that DEPO achieves up to a 2x reduction in rollout costs without compromising model performance. Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling. Code and data will be released upon acceptance.

研究动机与目标

推动降低 GRPO 基于 RLVR 的 rollout 成本和梯度噪声以提升 LRMs 的训练效率。
提出一个在线难度估计器，用以预测样本优势并在滚动前筛选数据。
展示在筛选条件下学习信号得以保留，同时提升效率与稳定性。
演示与现有 RLVR 框架及数据筛选策略的即插即用兼容性。

提出的方法

将一个轻量级的基于 BERT 的难度估计器集成到 GRPO 中，以预测每个提示的估计优势。
在线与 actor 一起训练估计器，采用三部分联合目标：优势估计损失、蒸馏损失（actor 概率困惑度）、以及对偶排列损失以 enforcing 正确的难度排序。
在滚动前筛除零优势样本以减少计算，同时使用 GRPO 推导的优势更新估计器。
采用两阶段训练：先进行估计器预热阶段，无筛选；随后进入主动在线筛选。
提供一个在线模型路由能力，利用估计器基于置信阈值将查询路由到异质模型。

实验结果

研究问题

RQ1在线难度估计是否能在不降低推理性能的前提下降低 GRPO 基于 RLVR 的 rollout 成本？
RQ2不同训练目标（优势估计、蒸馏、排序）如何影响难度估计器的质量与稳定性？
RQ3在线筛选对数学推理基准的训练效率与学习信号质量有何影响？
RQ4估计器是否能作为在线路由器，在不同容量模型间平衡准确性与效率？

主要发现

数据集	方法	GSM8K	MATH	AMC23	Olympiad	Minerva	Avg.	GPU Hours ↓
DAPO-MATH-17K	Qwen2.5-1.5B-Instruct	75.6	48.1	38.4	15.8	11.4	37.9	528 (1.0 ×)
DAPO	-	78.5	50.1	39.3	17.8	13.1	39.8	905 (1.7 ×)
Polaris	-	77.1	47.3	40.8	16.4	11.8	38.7	584 (1.1 ×)
DEPO	-	77.0	48.9	42.3	16.7	12.2	39.4	530 (1.0 ×)
– ranking loss	-	76.6	48.0	40.9	16.3	12.1	38.8	-
– distill loss	-	75.2	48.0	39.0	15.9	12.0	38.0	-
+ DAPO w/o Dynamic Sampling	-	78.3	50.6	41.7	17.5	13.3	40.3	-
Qwen2.5-7B-Instruct	GRPO	91.9	64.1	63.4	27.9	25.0	54.5	776 (1.0 ×)
Qwen2.5-7B-Instruct	DEPO	92.3	63.9	63.5	28.7	25.5	54.8	782 (1.0 ×)
OR1	Qwen2.5-7B-Instruct	GRPO	92.0	63.3	48.9	26.4	26.2	51.4	-
OR1	Qwen2.5-7B-Instruct	DEPO	91.8	64.0	51.0	27.6	26.6	52.2	-
NT	Qwen2.5-7B-Instruct	GRPO	90.1	62.7	48.9	25.3	23.8	50.1	-
NT	Qwen2.5-7B-Instruct	DEPO	90.8	63.2	53.2	25.6	25.0	51.6	-

DEPO 在多项数学推理基准上实现了与之相当或更高的准确性，同时降低 rollout 成本。
相较于 DAPO 等竞争基线，DEPO 在 rollout 效率上实现最高约 2× 的加速，同时总训练时延接近 GRPO 的水平。
同时结合排序损失与蒸馏损失对难度判别的鲁棒性和后续性能提升至关重要。
在线难度估计器在简短的预热后收敛，能够与真实奖励高度一致，从而有效裁剪低信息提示。
DEPO 与现有方法正交且互补，结合使用时还能进一步提升性能。
将 DEPO 作为在线路由器使用时，在路由大量查询至较小、成本低的模型的同时，仍能与更大模型保持竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。