QUICK REVIEW

[論文レビュー] ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

Shuaiyi Nie, Siyu Ding|arXiv (Cornell University)|Feb 10, 2026

Explainable Artificial Intelligence (XAI)被引用数 0

ひとこと要約

ATNPO は intrinsic モデル注意を利用してステップレベル監督を提供し、推論長を短縮しつつ、複数のベンチマークで精度を改善、オーバーヘッドは最小限。

ABSTRACT

Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.

研究の動機と目的

Identify whether a subset of attention heads (Key-Focus Heads) naturally focus on essential reasoning steps.
Develop a low-overhead RL framework that uses attention signals for step-level credit assignment.
Mitigate overthinking by attenuating redundant steps while preserving essential reasoning.
Demonstrate efficiency gains (shorter reasoning) without sacrificing accuracy across diverse benchmarks.

提案手法

Identify Key-Focus Heads (KFHs) via probing attention distributions over essential vs. redundant steps.
Define stepwise advantage scaling A_hat for correct responses using KFH attention scores (Eq. 4).
Introduce two strategies: Pos-Adv Attenuation (PA) to reduce credit for redundant steps when A^i > 0, and Neg-Adv Attenuation (NA) to soften penalties on essential steps when A^i < 0.
Compute a difficulty-aware baseline S_base^i using problem difficulty and response characteristics (Eq. 7).
Apply a scheduling mechanism (Eq. 8) to modulate attenuation strength based on step redundancy and training progress.
Evaluate AttnPO on math, coding, and science tasks, comparing against outcome-supervised and process-supervised baselines.

実験結果

リサーチクエスチョン

RQ1Intrinsic 注意信号（KFH）が、追加リソースを必要とせずに細分化されたステップレベル監督に活用できるか。
RQ2ステップワイズ有利性の再スケーリングは、推論長を短縮しつつ精度を維持または向上させるか。
RQ3Redundant vs. Essential なステップは層とヘッドにどう分布し、推論長ペナルティを用いた RL 下で KFH の挙動はどれほど頑健か。
RQ4AttnPO は探索と/domain外タスクへの一般化にどのような影響を与えるか。

主な発見

一部の注意ヘッド（KFH）が一貫して必須ステップへ焦点を合わせ、冗長なものを抑制する。ステップランキング精度（SRA）は高く、評価モデルで約 0.95–0.96 程度。
AttnPO は推論長を大幅に削減し（例: 1.5B で 61%、7B で 55%）、精度向上を達成（例: 1.5B の六つの数学ベンチマークで +7.3 点）。
AIME2024 では 1.5B モデルで長さを 54% 減少させつつ +9.6 点の精度向上を達成；7B では長さを 55% 減少、 +2.9 点の精度向上。
AttnPO の長さ削減はモデル規模を問わず頑健で、out-of-domain パフォーマンス（LiveCodeBench、GPQA、MMLU）を維持または向上させる。
PA のみで推論長を大幅に削減可能；NA を追加することで、重要ステップへの過度なペナルティを緩和して精度をさらに向上させる。
上位約3つの KFHs の小さなセットで十分；それ以上追加しても収益増は鈍化する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。