QUICK REVIEW

[論文レビュー] Self-Verification Dilemma: Experience-Driven Suppression of Overused Checking in LLM Reasoning

Quanyu Long, Kai Jie Jiang|arXiv (Cornell University)|Feb 3, 2026

Explainable Artificial Intelligence (XAI)被引用数 0

ひとこと要約

論文はLLMの推論過程における多くの自己検証（再チェック）ステップが主に確認的であることを示し、経験駆動の試行時フレームワークを提案して冗長な再検査を選択的に抑制し、トークンを削減しつつ精度を維持または向上させる。

ABSTRACT

Large Reasoning Models (LRMs) achieve strong performance by generating long reasoning traces with reflection. Through a large-scale empirical analysis, we find that a substantial fraction of reflective steps consist of self-verification (recheck) that repeatedly confirm intermediate results. These rechecks occur frequently across models and benchmarks, yet the vast majority are confirmatory rather than corrective, rarely identifying errors and altering reasoning outcomes. This reveals a mismatch between how often self-verification is activated and how often it is actually useful. Motivated by this, we propose a novel, experience-driven test-time framework that reduces the overused verification. Our method detects the activation of recheck behavior, consults an offline experience pool of past verification outcomes, and estimates whether a recheck is likely unnecessary via efficient retrieval. When historical experience suggests unnecessary, a suppression signal redirects the model to proceed. Across multiple model and benchmarks, our approach reduces token usage up to 20.3% while maintaining the accuracy, and in some datasets even yields accuracy improvements.

研究の動機と目的

推論中にLLMが reflective self-verification（反省的自己検証）をどの程度頻繁に行うかを定量化する。
rethinkとrecheckを区別し、反省の機能的役割を理解する。
rechecksのうち修正的 vs 確認的の頻度と、それが精度に及ぼす影響を評価する。
モデルの再訓練を伴わず、低有用な再検査を抑制するオフラインの経験駆動の試行時フレームワークを提案する。
複数のモデルと数学ベンチマークで提案手法の効率向上と精度トレードオフを実証する。

提案手法

Reasoning tracesの reflective stepsを empiricalに分析し、 rethinkとrecheckを分類する。
recheckの結果を correctiveまたは confirmatoryとしてGPT-5と人間のチェックでアノテーションする。
過去の再検査の文脈と必要性を記録したオフラインの経験プールを構築する。
軽量な recheck 活性化検出器（精度>97%の二値分類器）を開発する。
現在の recheck が有用かを推定するために、BM25を用いて上位k個の類似経験ユニットを取得する。
過去の経験が再検査が有益である可能性を示唆する場合に抑制信号を注入し、モデルパラメータは変更しない。

Figure 1 : Reflective behaviors commonly observed in step-by-step mathematical reasoning. We illustrate three categories: rethink, where the model revises its strategy and explores an alternative line of reasoning; and recheck, where the model verifies already-derived intermediate results through re

実験結果

リサーチクエスチョン

RQ1ベンチマークとモデルを横断して、LLMは推論中に reflective self-verificationをどの程度頻繁に示すか？
RQ2 rechecksのうち修正的 vs 確認的の割合はどのくらいで、それは有用性にどう影響するか？
RQ3 retrainingせずに、過去の検証経験を活用して試行時に冗長な再検査を選択的に抑制できるか？
RQ4 経験駆動の抑制（EDS）を様々な数学ベンチマークに適用した場合の精度と効率のトレードオフはどうなるか？

主な発見

Model	Dataset	Accuracy_Base (%)	Accuracy_FullSuppress (%)	Accuracy_EDS (%)	Length_Base	Length_FullSuppress	Length_EDS
Qwen3-8B	AIME24	74.58	70.63 (-3.95)	72.92 (-1.66)	14605	12734 (-12.8%)	13296 (-9.0%)
Qwen3-8B	AIME25	67.71	66.67 (-1.04)	70.00 (+2.29)	17133	15713 (-8.3%)	16086 (-6.1%)
Qwen3-8B	AMC	95.62	96.25 (+0.63)	98.75 (+3.13)	8091	6564 (-18.9%)	6893 (-14.8%)
Qwen3-8B	Math500	95.80	95.20 (-0.60)	97.20 (+1.40)	4939	3935 (-20.3%)	4110 (-16.8%)
Qwen3-8B	Olympiad Bench	80.42	79.53 (-0.89)	79.82 (-0.60)	10480	9540 (-9.0%)	9739 (-7.1%)
QWQ-32B	AIME2024	79.17	78.75 (-0.42)	83.33 (+4.16)	11237	10105 (-13.4%)	10478 (-9.5%)
QWQ-32B	AIME2025	68.54	64.16 (-4.38)	65.63 (-2.91)	15811	14133 (-10.6%)	14908 (-5.7%)
QWQ-32B	AMC	97.50	93.75 (-3.75)	95.00 (-2.50)	7542	6526 (-13.5%)	6719 (-10.9%)
QWQ-32B	Math500	97.00	95.60 (-1.40)	97.00 (-0.00)	4659	3768 (-19.1%)	3940 (-15.4%)
QWQ-32B	Olympiad Bench	81.90	81.45 (-0.45)	83.53 (+1.63)	9602	8454 (-12.0%)	8710 (-9.3%)
DeepSeek-7B	AIME24	57.50	56.67 (-0.83)	58.75 (+1.25)	11237	10105 (-10.1%)	10478 (-6.8%)
DeepSeek-7B	AIME25	39.38	35.42 (-3.96)	36.46 (-2.92)	12489	11221 (-10.1%)	11680 (-7.4%)
DeepSeek-7B	AMC	91.25	90.00 (-1.25)	90.63 (-0.62)	5401	5067 (-6.2%)	5145 (-4.7%)
DeepSeek-7B	Math500	90.60	87.20 (-3.40)	89.80 (-0.80)	3303	2726 (-17.5%)	2891 (-12.5%)
DeepSeek-7B	Olympiad Bench	69.00	66.91 (-2.09)	67.95 (-1.05)	7913	7002 (-11.5%)	7183 (-9.2%)

reflective stepsは推論の substantial portionを占め、モデルとベンチマーク全体でしばしば全体の1/3に迫るかそれを超える。
rechecksは反省の大きな部分を占め（約40–58%）、難易度が低いデータセットでは戦略の改正より局所的検証としてより普及する。
rechecksの約85–95%は確認的で、中間結果や最終回答を変更しない。
オフラインの経験プールにより、現在の recheck が有益かを見積もることができ、選択的抑制を可能にする。
EDSは平均推論長を約9%削減し、MATH500では最大20.3%の削減を達成しつつ、モデル/データセット全体で精度を維持またはわずかに向上。
完全抑制や過度な切り捨て手法と比較して、EDSは必要な rethink および有益な rechecks を保持し、精度-効率の有利なトレードオフを実現する。

Figure 2 : Percentage of steps classified as reflections.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。