[论文解读] RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization
RePO introduces a two-stage Rephrasing Policy Optimization framework that digests off-policy knowledge and rephrases it into the model’s on-policy style, replacing low-quality rollouts with rephrased high-quality trajectories to improve hard-sample learning while preserving on-policy stability.
Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.
研究动机与目标
- Motivate the challenge of injecting domain-specific knowledge into LLMs without sacrificing general reasoning.
- Address instability and inefficiency of combining on-policy RL with off-policy data for hard samples.
- Propose a principled mechanism to assimilate off-policy guidance while preserving the model’s distribution.
提出的方法
- Introduce RePO with a two-phase knowledge assimilation: (1) Knowledge Internalization where an off-policy trajectory is rephrased into the model’s native style via a rephrasing prompt; (2) Dynamic Guidance where a rephrased trajectory replaces a low-reward on-policy rollout when a group failure rate threshold is exceeded.
- Use Joint Probability Trajectory Sampling conditioned on off-policy knowledge to generate o_rep from a prompt P(q,k).
- Apply a Dynamic Guidance Strategy based on Group Reward Distribution with hyperparameters delta (reward threshold) and rho (minimum failure rate) to decide when to substitute o_rep for the worst on-policy rollout.
- Optimize the final rollout group with the GRPO objective, ensuring updates remain aligned with the model’s distribution.
实验结果
研究问题
- RQ1Can RePO effectively leverage off-policy knowledge without destabilizing on-policy learning?
- RQ2Does rephrasing off-policy guidance into the model’s native vocabulary improve learning from hard samples?
- RQ3How does RePO compare to GRPO and LUFFY in stability and performance across math and knowledge benchmarks?
主要发现
- RePO outperforms standard on-policy RL baselines and existing off-policy methods on several benchmarks, achieving state-of-the-art results.
- RePO significantly improves hard-sample utilization on GPQA and AIME datasets compared with GRPO.
- RePO maintains robustness and stability unlike LUFFY, which shows instability on GPQA; LUFFY can suffer from vocabulary mismatch.
- On financial-domain benchmarks, RePO delivers strong knowledge injection while preserving general reasoning abilities.
- Training stability analyses indicate RePO achieves consistent entropy, GradNorm, and rewards, reflecting stable updates.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。