[论文解读] RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization
RePO introduces a two-stage Rephrasing Policy Optimization framework that digests off-policy knowledge and rephrases it into the model’s on-policy style, replacing low-quality rollouts with rephrased high-quality trajectories to improve hard-sample learning while preserving on-policy stability.
Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.
研究动机与目标
- Motivate the challenge of injecting domain-specific knowledge into LLMs without sacrificing general reasoning.
- Address instability and inefficiency of combining on-policy RL with off-policy data for hard samples.
- Propose a principled mechanism to assimilate off-policy guidance while preserving the model’s distribution.
提出的方法
- Introduce RePO with a two-phase knowledge assimilation: (1) Knowledge Internalization where an off-policy trajectory is rephrased into the model’s native style via a rephrasing prompt; (2) Dynamic Guidance where a rephrased trajectory replaces a low-reward on-policy rollout when a group failure rate threshold is exceeded.
- Use Joint Probability Trajectory Sampling conditioned on off-policy knowledge to generate o_rep from a prompt P(q,k).
- Apply a Dynamic Guidance Strategy based on Group Reward Distribution with hyperparameters delta (reward threshold) and rho (minimum failure rate) to decide when to substitute o_rep for the worst on-policy rollout.
- Optimize the final rollout group with the GRPO objective, ensuring updates remain aligned with the model’s distribution.
实验结果
研究问题
- RQ1Can RePO effectively leverage off-policy knowledge without destabilizing on-policy learning?
- RQ2Does rephrasing off-policy guidance into the model’s native vocabulary improve learning from hard samples?
- RQ3How does RePO compare to GRPO and LUFFY in stability and performance across math and knowledge benchmarks?
主要发现
| Method | GPQA | AIME24 | AIME25 | AMC | MATH-500 | Minerva | Olympiad |
|---|---|---|---|---|---|---|---|
| Qwen3-8B | 58.1 | 75.1 | 66.4 | 88.9 | 96.2 | 51.1 | 69.2 |
| GRPO | 59.2 | 75.1 | 65.8 | 89.3 | 94.8 | 65.4 | 69.8 |
| LUFFY | 49.8 | 75.5 | 64.1 | 87.9 | 94.0 | 66.5 | 68.7 |
| RePO (Ours) | 61.8 | 75.8 | 72.5 | 88.6 | 94.8 | 68.1 | 68.1 |
- RePO outperforms standard on-policy RL baselines and existing off-policy methods on several benchmarks, achieving state-of-the-art results.
- RePO significantly improves hard-sample utilization on GPQA and AIME datasets compared with GRPO.
- RePO maintains robustness and stability unlike LUFFY, which shows instability on GPQA; LUFFY can suffer from vocabulary mismatch.
- On financial-domain benchmarks, RePO delivers strong knowledge injection while preserving general reasoning abilities.
- Training stability analyses indicate RePO achieves consistent entropy, GradNorm, and rewards, reflecting stable updates.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。