Skip to main content
QUICK REVIEW

[论文解读] RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

Linxuan Xia, Xiaolong Yang|arXiv (Cornell University)|Feb 11, 2026
Topic Modeling被引用 0
一句话总结

RePO introduces a two-stage Rephrasing Policy Optimization framework that digests off-policy knowledge and rephrases it into the model’s on-policy style, replacing low-quality rollouts with rephrased high-quality trajectories to improve hard-sample learning while preserving on-policy stability.

ABSTRACT

Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.

研究动机与目标

  • Motivate the challenge of injecting domain-specific knowledge into LLMs without sacrificing general reasoning.
  • Address instability and inefficiency of combining on-policy RL with off-policy data for hard samples.
  • Propose a principled mechanism to assimilate off-policy guidance while preserving the model’s distribution.

提出的方法

  • Introduce RePO with a two-phase knowledge assimilation: (1) Knowledge Internalization where an off-policy trajectory is rephrased into the model’s native style via a rephrasing prompt; (2) Dynamic Guidance where a rephrased trajectory replaces a low-reward on-policy rollout when a group failure rate threshold is exceeded.
  • Use Joint Probability Trajectory Sampling conditioned on off-policy knowledge to generate o_rep from a prompt  P(q,k).
  • Apply a Dynamic Guidance Strategy based on Group Reward Distribution with hyperparameters delta (reward threshold) and rho (minimum failure rate) to decide when to substitute o_rep for the worst on-policy rollout.
  • Optimize the final rollout group with the GRPO objective, ensuring updates remain aligned with the model’s distribution.

实验结果

研究问题

  • RQ1Can RePO effectively leverage off-policy knowledge without destabilizing on-policy learning?
  • RQ2Does rephrasing off-policy guidance into the model’s native vocabulary improve learning from hard samples?
  • RQ3How does RePO compare to GRPO and LUFFY in stability and performance across math and knowledge benchmarks?

主要发现

MethodGPQAAIME24AIME25AMCMATH-500MinervaOlympiad
Qwen3-8B58.175.166.488.996.251.169.2
GRPO59.275.165.889.394.865.469.8
LUFFY49.875.564.187.994.066.568.7
RePO (Ours)61.875.872.588.694.868.168.1
  • RePO outperforms standard on-policy RL baselines and existing off-policy methods on several benchmarks, achieving state-of-the-art results.
  • RePO significantly improves hard-sample utilization on GPQA and AIME datasets compared with GRPO.
  • RePO maintains robustness and stability unlike LUFFY, which shows instability on GPQA; LUFFY can suffer from vocabulary mismatch.
  • On financial-domain benchmarks, RePO delivers strong knowledge injection while preserving general reasoning abilities.
  • Training stability analyses indicate RePO achieves consistent entropy, GradNorm, and rewards, reflecting stable updates.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。