QUICK REVIEW

[论文解读] On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong|arXiv (Cornell University)|Feb 12, 2026

Topic Modeling被引用 0

一句话总结

tldr: OPCD 通过 on-policy 采样和 reverse KL 来训练学生模仿一个上下文条件化的教师，使在数学、游戏和领域任务中的经验证知识和系统提示得以内化，优于离策略上下文蒸馏。

ABSTRACT

Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.

研究动机与目标

解决离策略上下文蒸馏的局限性（暴露偏差与模式问题）。
提出 On-Policy Context Distillation (OPCD)，通过学生自身的轨迹对比上下文条件化教师来学习。
在数学、游戏和领域任务中演示 OPCD 的经验证知识蒸馏与系统提示蒸馏。
展示 OPCD 支持跨尺度蒸馏，即小模型从大教师中学习并降低遗忘。

提出的方法

通过 on-policy 样本最小化学生与上下文感知教师之间的反向 KL 散度。
通过 top-k 令牌近似计算 token-level D_KL，促进模式寻求行为。
在没有上下文的情况下让学生生成回答再与以上下文为条件的教师分布对齐进行训练。
允许灵活的教师配置（教师-学生，教师被冻结，或自蒸馏共享权重）。
在数学题、文本类游戏、医疗和安全提示等经验知识与系统提示蒸馏任务上进行评估，并与离策略上下文蒸馏基线进行对比。

实验结果

研究问题

RQ1On-policy 上下文蒸馏能否将瞬时的上下文知识内化到模型参数中？
RQ2OPCD 是否在跨域上改进经验知识巩固与系统提示蒸馏？
RQ3较小的学生模型能否通过 OPCD 从较大、可能被冻结的教师中受益？
RQ4与离策略方法相比，OPCD 是否减轻对分布外任务的遗忘？

主要发现

OPCD 在数学题和文本类游戏的测试准确性上优于离策略上下文蒸馏。
OPCD 在保持在分布内准确性的同时获得更好的分布外（OOD）表现。
在系统提示蒸馏方面，OPCD 的医疗和安全任务准确性高于离策略基线。
OPCD 实现了有效的跨尺度蒸馏，小模型能够从大型冻结教师中受益。
相比离策略方法，On-policy 训练提供更稳定的改进并减少对分布外数据的遗忘。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。