QUICK REVIEW

[論文レビュー] Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models

Hui Wu, Hengyi Cai|arXiv (Cornell University)|Feb 1, 2026

Explainable Artificial Intelligence (XAI)被引用数 0

ひとこと要約

SAGE は訓練中に優先ペアを動的に選択・スコアリングして勾配効率と安定性を最大化し、長鎖推論タスクにおける静的な DPO ベースラインを上回る。

ABSTRACT

Preference-based alignment is pivotal for training large reasoning models; however, standard methods like Direct Preference Optimization (DPO) typically treat all preference pairs uniformly, overlooking the evolving utility of training instances. This static approach often leads to inefficient or unstable optimization, as it wastes computation on trivial pairs with negligible gradients and suffers from noise induced by samples near uncertain decision boundaries. Facing these challenges, we propose SAGE (Stability-Aware Gradient Efficiency), a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates. Concretely, SAGE integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence with a fine-grained, stability-aware scoring function that prioritizes informative, confident errors while filtering out unstable samples. Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines, highlighting the critical role of policy-aware, stability-conscious data selection in reasoning alignment.

研究の動機と目的

長鎖推論モデルの優先データ整列における動的・方針認識的データ選択の必要性を動機付ける。
coarse-grained pool refresh を用い、粗粒度プール刷新と fine-grained stability-aware scoring を結合した二段階フレームワーク SAGE を導入する。
安定性を意識した勾配効率の高いデータ選択が数学的推論ベンチマークでの収束と性能を改善することを実証する。
モデル規模に対するデータ効率と安定した最適化の向上を SAGE が提供することを示す。

提案手法

オンポリシー候補プールを時間とともに更新する coarse-grained refreshable pool 戦略を導入する。
Newton にインスパイアされた曲率意識型代理モデルを用いて勾配信号と予測信頼度のバランスを取る fine-grained SAGE スコアを開発する。
SAGE 目的をスコアに基づくハードフィルタリングとして定義し、ハイSNR のサブセットを用いた訓練を実施する。
難易度スケジュールを用いてプール構成（易・中・難）を偏らせ、応答長を正規化したスコアで長さを制御する。
プール構築・情報量・勾配信号・曲率正則化の寄与を示すアブレーションを提供する。

Figure 1: Schematic illustration of dynamic sample utility in the stability-informativeness space. SAGE prioritizes samples with high signal quality and stable optimization behavior.

実験結果

リサーチクエスチョン

RQ1ダイナミックで方針適応的なデータ選択は推論モデルの整列において安定性と効率を改善するか。
RQ2安定性を意識したユーティリティスコアは DPO の静的設定と比べて勾配品質と収束を改善するか。
RQ3粗粒度カリキュラムと細粒度 SAGE スコアリングが長い Chain-of-Thought の数学推論ベンチマークに与える影響はどうか。
RQ4モデル規模とデータ予算は SAGE のデータ選択戦略とどのように相互作用するか。

主な発見

Method	GSM8K	MATH500	Minerva	Gaokao	Olympiad	CollegeMath	AMC23	AIME24	Avg
Vanilla (Qwen2.5-1.5B-Instruct)	73.70	54.60	16.90	46.20	22.70	38.40	6.70	25.00	35.53
w/ DPO (Full)	74.70	56.20	19.50	47.30	20.00	38.00	10.00	22.50	36.03
w/ DPO (Random)	73.50	56.40	19.10	48.60	19.60	37.90	3.30	25.00	35.43
SAGE (Ours)	74.80	57.20	20.20	50.40	21.50	38.10	10.00	27.50	37.46
Vanilla (Qwen2.5-3B-Instruct)	86.90	65.20	25.70	56.40	27.70	44.50	6.70	47.50	45.08
w/ DPO (Full)	86.40	65.60	27.20	56.90	27.00	44.90	10.00	50.00	46.00
w/ DPO (Random)	87.00	65.20	26.10	56.40	26.50	45.00	0.00	45.00	43.90
SAGE (Ours)	87.50	66.00	28.30	58.23	27.70	45.14	13.30	55.00	47.65
Vanilla (Qwen2.5-7B-Instruct)	92.30	81.60	28.30	69.90	45.30	42.40	23.30	57.50	55.08
w/ DPO (Full)	92.70	82.00	29.40	70.60	46.50	42.70	26.70	62.50	56.64
w/ DPO (Random)	91.30	79.40	26.80	71.40	43.00	42.70	20.00	57.50	54.01
SAGE (Ours)	93.10	82.80	33.10	71.40	45.50	43.10	33.30	70.00	59.04

SAGE は 1.5B・3B・7B Qwen モデルの8つの数学推論ベンチマークで標準的な DPO を一貫して上回る。
曲率正則化と勾配信号成分が特に難易度の高いベンチマークで利益を生む重要な要因である。
SAGE は DPO より勾配分散を低減し、最適化の軌跡をより安定させつつ最終的な精度を向上させる。
中程度のキープ比率（γ=[0.4,0.6]）は、計算と性能のバランスを取りつつ、より少ない有効トークンでより良い精度を得られる。
SAGE の利益は中間モデルサイズ（3B）と難易度の高いタスクで最も顕著で、優先監視の活用が改善されていることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。