[论文解读] PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary
PRL 将熵正则化的强化学习分解为中间过程奖励,在推理过程中提供密集监督,从而提升平均推理性能并扩大跨多种数学推理基准的推理边界。
Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs' reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show the effectiveness of PRL could be verified and generalized.
研究动机与目标
- Motivate the need for finer-grained supervision during multi-step reasoning beyond end-of-trajectory rewards.
- Introduce a theoretically grounded framework that converts outcome rewards into process rewards without relying on costly tools like MCTS or separate reward models.
- Demonstrate that process-guided RL improves both average reasoning performance and the reach of correct reasoning (pass@N) across model families.
- Provide an efficient, KL-regularized RL approach that integrates process supervision into standard policy-gradient training.
提出的方法
- Derive PRL by decomposing the entropy-regularized RL objective into intermediate steps with process rewards that align with the global outcome.
- Define optimal process rewards r* that decompose as the outcome reward plus a log-ratio penalty between the policy and a reference model (pi* proportional to pi0 exp(eta r*)).
- Introduce a learnable reward model r_u and learnable policy pi_w, trained via policy-gradient style objectives with process-adaptive advantages.
- Propose a practical objective L(ω) that includes KL penalties to keep the policy close to the reference while leveraging process rewards and clipped importance-sampling for stability.
- Implement PRL in a standard RL setting without MCTS or separate reward models, using entropy-regularized objectives and optional GRPO-style advantage estimation.
- Provide algorithmic steps (Algorithm 1) detailing data sampling, trajectory generation, process and outcome reward computation, and policy updates.
![Figure 1: PRL workflow demonstration. For each prompt and response trajectory $(x,a)$ with $a=[a^{1},a^{2},\cdots,a^{L}]$ , we could split the reasoning response into several intermediate steps (by fixed length, newline symbol, etc.) and calculate the process reward as the entropy ratio between the](https://ar5iv.labs.arxiv.org/html/2601.10201/assets/x1.png)
实验结果
研究问题
- RQ1Does Process Reward Learning improve average reasoning performance (average@8) across diverse math-reasoning benchmarks and base models?
- RQ2Does PRL broaden the reasoning boundary (pass@N) indicating better generalization to more challenging prompts?
- RQ3Can process rewards be rigorously defined and used to guide exploration without expensive search or auxiliary reward models?
- RQ4Is PRL efficient enough to train with standard policy-gradient pipelines, avoiding Monte Carlo Tree Search or separate reward-model training?
主要发现
| MATH500 | Minerva Math | Olympiad Bench | AMC23 | AIME24 | Avg |
|---|---|---|---|---|---|
| 81.60 | 35.66 | 48.15 | 65.00 | 20.00 | 56.82 |
| 87.40 | 42.65 | 52.00 | 77.50 | 33.33 | 62.23 |
| 88.00 | 44.85 | 56.44 | 70.00 | 20.00 | 64.40 |
| 89.40 | 45.59 | 58.07 | 85.00 | 30.00 | 66.31 |
| 82.00 | 31.03 | 51.56 | 72.50 | 40.00 | 58.24 |
| 91.80 | 51.84 | 62.22 | 82.50 | 46.67 | 70.34 |
| 92.60 | 52.21 | 65.33 | 85.00 | 46.67 | 72.12 |
| 93.60 | 52.57 | 65.19 | 85.00 | 43.33 | 72.38 |
| 45.20 | 8.46 | 12.89 | 20.00 | 6.67 | 22.81 |
| 57.80 | 17.65 | 22.81 | 30.00 | 13.33 | 33.42 |
| 60.60 | 17.65 | 20.15 | 30.00 | 6.67 | 33.03 |
| 67.80 | 28.31 | 28.30 | 45.00 | 23.33 | 41.66 |
| 76.80 | 36.03 | 39.56 | 55.00 | 16.67 | 51.16 |
| 74.00 | 36.40 | 41.33 | 67.50 | 16.67 | 51.42 |
- PRL yields consistently better performance than baselines (RAFT, GRPO, etc.) across multiple base models and math benchmarks.
- PRL improves average@8 scores, indicating stronger overall reasoning accuracy.
- PRL broadens the reasoning boundary as shown by improvements in pass@8 metrics across evaluated models.
- On Table-derived results, PRL achieves the highest Avg scores among the listed configurations for several base models (e.g., Qwen2.5-Math-1.5B, Qwen-2.5-Math-7B, Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct) with detailed numbers provided below.
- PRL demonstrates significant gains in both average performance and boundary metrics relative to outcome-only rewards, validating the effectiveness of process supervision.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。