QUICK REVIEW

[论文解读] RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

Linxuan Xia, Xiaolong Yang|arXiv (Cornell University)|Feb 11, 2026

Topic Modeling被引用 0

一句话总结

RePO introduces a two-stage Rephrasing Policy Optimization framework that digests off-policy knowledge and rephrases it into the model’s on-policy style, replacing low-quality rollouts with rephrased high-quality trajectories to improve hard-sample learning while preserving on-policy stability.

ABSTRACT

Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model's generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model's current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.

研究动机与目标

Motivate the challenge of injecting domain-specific knowledge into LLMs without sacrificing general reasoning.
Address instability and inefficiency of combining on-policy RL with off-policy data for hard samples.
Propose a principled mechanism to assimilate off-policy guidance while preserving the model’s distribution.

提出的方法

Introduce RePO with a two-phase knowledge assimilation: (1) Knowledge Internalization where an off-policy trajectory is rephrased into the model’s native style via a rephrasing prompt; (2) Dynamic Guidance where a rephrased trajectory replaces a low-reward on-policy rollout when a group failure rate threshold is exceeded.
Use Joint Probability Trajectory Sampling conditioned on off-policy knowledge to generate o_rep from a prompt P(q,k).
Apply a Dynamic Guidance Strategy based on Group Reward Distribution with hyperparameters delta (reward threshold) and rho (minimum failure rate) to decide when to substitute o_rep for the worst on-policy rollout.
Optimize the final rollout group with the GRPO objective, ensuring updates remain aligned with the model’s distribution.

实验结果

研究问题

RQ1Can RePO effectively leverage off-policy knowledge without destabilizing on-policy learning?
RQ2Does rephrasing off-policy guidance into the model’s native vocabulary improve learning from hard samples?
RQ3How does RePO compare to GRPO and LUFFY in stability and performance across math and knowledge benchmarks?

主要发现

Method	GPQA	AIME24	AIME25	AMC	MATH-500	Minerva	Olympiad
Qwen3-8B	58.1	75.1	66.4	88.9	96.2	51.1	69.2
GRPO	59.2	75.1	65.8	89.3	94.8	65.4	69.8
LUFFY	49.8	75.5	64.1	87.9	94.0	66.5	68.7
RePO (Ours)	61.8	75.8	72.5	88.6	94.8	68.1	68.1

RePO outperforms standard on-policy RL baselines and existing off-policy methods on several benchmarks, achieving state-of-the-art results.
RePO significantly improves hard-sample utilization on GPQA and AIME datasets compared with GRPO.
RePO maintains robustness and stability unlike LUFFY, which shows instability on GPQA; LUFFY can suffer from vocabulary mismatch.
On financial-domain benchmarks, RePO delivers strong knowledge injection while preserving general reasoning abilities.
Training stability analyses indicate RePO achieves consistent entropy, GradNorm, and rewards, reflecting stable updates.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。