QUICK REVIEW

[論文レビュー] On-Policy Context Distillation for Language Models

Tianzhu Ye, Li Dong|arXiv (Cornell University)|Feb 12, 2026

Topic Modeling被引用数 0

ひとこと要約

OPCD は context-conditioned 教師を模倣する student を on-policy サンプリングと reverse KL で訓練し、数学、ゲーム、ドメイン課題全体で経験知とシステムプロンプトの内部化を可能にし、オフポリシーの context distillation を超える。

ABSTRACT

Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.

研究の動機と目的

オフポリシー文脈蒸留の限界（露出バイアスとモードの問題）に対処する。
context-conditioned teacher に対して student 自身の軌跡から学ぶ On-Policy Context Distillation（OPCD）を提案する。
数学、ゲーム、ドメイン課題を横断した経験的知識蒸留とシステムプロンプト蒸留における OPCD を実証する。
OPCD が小さなモデルが大きな教師から学ぶクロスサイズ蒸留を支援し、忘却を低減することを示す。

提案手法

on-policy サンプルを用いて student と context-aware teacher の reverse KL 発散を最小化する。
トークンレベルの D_KL を top-k トークン近似で計算し、モード志向の挙動を促進する。
文脈なしで student が応答を生成し、その後文脈条件付きで教師の分布と整合させることで訓練する。
柔軟な教師構成を許容する（凍結された教師を用いる教師-学生、重みを共有する自己蒸留）。
数学の問題、テキストベースのゲーム、医療・安全プロンプトを含む経験知とシステムプロンプト蒸留タスクで評価し、オフポリシーの context distillation ベースラインと比較する。

実験結果

リサーチクエスチョン

RQ1On-policy context distillation は瞬時の in-context 知識をモデルパラメータへ内在化できるか。
RQ2OPCD は領域を横断した経験知の統合とシステムプロンプト蒸留を改善するか。
RQ3小さな student モデルは OPCD を介して大きな frozen 教師から恩恵を受けられるか。
RQ4OPCD はオフポリシー手法と比較して out-of-distribution (OOD) タスクでの忘却を抑制するか。

主な発見

OPCD は math 問題とテキストベースのゲームでのテスト精度においてオフポリシーの context distillation を上回る。
OPCD はOOD性能を改善しつつ、in-distribution の精度を維持する。
OPCD によるシステムプロンプト蒸留はオフポリシー基準より医療・安全タスクの精度が高い。
OPCD はクロスサイズ蒸留を有効に可能にし、小型モデルが大きな凍結教師から恩恵を受ける。
on-policy トレーニングはより安定した改善をもたらし、OOD データでの忘却をオフポリシー手法より抑制する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。