Skip to main content
QUICK REVIEW

[论文解读] Self-Verification Dilemma: Experience-Driven Suppression of Overused Checking in LLM Reasoning

Quanyu Long, Kai Jie Jiang|arXiv (Cornell University)|Feb 3, 2026
Explainable Artificial Intelligence (XAI)被引用 0
一句话总结

该论文表明,LLM推理中的许多自我验证(重新检查)步骤大多是确认性的,并提出一种基于经验的测试时框架,以有选择地抑制冗余的重新检查,在减少代币数量的同时保持或提高准确性。

ABSTRACT

Large Reasoning Models (LRMs) achieve strong performance by generating long reasoning traces with reflection. Through a large-scale empirical analysis, we find that a substantial fraction of reflective steps consist of self-verification (recheck) that repeatedly confirm intermediate results. These rechecks occur frequently across models and benchmarks, yet the vast majority are confirmatory rather than corrective, rarely identifying errors and altering reasoning outcomes. This reveals a mismatch between how often self-verification is activated and how often it is actually useful. Motivated by this, we propose a novel, experience-driven test-time framework that reduces the overused verification. Our method detects the activation of recheck behavior, consults an offline experience pool of past verification outcomes, and estimates whether a recheck is likely unnecessary via efficient retrieval. When historical experience suggests unnecessary, a suppression signal redirects the model to proceed. Across multiple model and benchmarks, our approach reduces token usage up to 20.3% while maintaining the accuracy, and in some datasets even yields accuracy improvements.

研究动机与目标

  • 量化LLM在推理过程中进行反思性自我验证的频率。
  • 区分 rethink 和 recheck,以理解反思的功能角色。
  • 评估重新检查中纠正性与确认性的比例及其对准确性的影响。
  • 提出一个离线的基于经验的测试时框架,以在不重新训练模型的情况下抑制低效的重新检查。
  • 展示所提方法在多种模型和数学基准上的效率提升与准确性权衡。

提出的方法

  • 通过对推理轨迹的实证分析,将反思步骤分类为 rethink 与 recheck。
  • 使用GPT-5和人工检查对重新检查的结果进行纠正性或确认性标注。
  • 构建一个离线经验库,记录过去重新检查的上下文与必要性。
  • 开发一个轻量级的重新检查激活检测器(二分类器,准确率>97%)。
  • 通过BM25检索前k个相似的经验单元,以估算当前重新检查的有效性。
  • 在 past experience 表明重新检查不太有利时,注入抑制信号,而不改变模型参数。
Figure 1 : Reflective behaviors commonly observed in step-by-step mathematical reasoning. We illustrate three categories: rethink, where the model revises its strategy and explores an alternative line of reasoning; and recheck, where the model verifies already-derived intermediate results through re
Figure 1 : Reflective behaviors commonly observed in step-by-step mathematical reasoning. We illustrate three categories: rethink, where the model revises its strategy and explores an alternative line of reasoning; and recheck, where the model verifies already-derived intermediate results through re

实验结果

研究问题

  • RQ1LLM在跨基准与模型中的推理过程中,反思性自我验证出现的频率如何?
  • RQ2重新检查中纠正性与确认性的比例是多少?对有用性的影响如何?
  • RQ3是否可以利用 past verification 经验,在测试时选择性抑制冗余的重新检查而无需重新训练?
  • RQ4将基于经验的抑制(EDS)应用于多样的数学基准时,会带来怎样的准确性与效率权衡?

主要发现

ModelDatasetAccuracy_Base (%)Accuracy_FullSuppress (%)Accuracy_EDS (%)Length_BaseLength_FullSuppressLength_EDS
Qwen3-8BAIME2474.5870.63 (-3.95)72.92 (-1.66)1460512734 (-12.8%)13296 (-9.0%)
Qwen3-8BAIME2567.7166.67 (-1.04)70.00 (+2.29)1713315713 (-8.3%)16086 (-6.1%)
Qwen3-8BAMC95.6296.25 (+0.63)98.75 (+3.13)80916564 (-18.9%)6893 (-14.8%)
Qwen3-8BMath50095.8095.20 (-0.60)97.20 (+1.40)49393935 (-20.3%)4110 (-16.8%)
Qwen3-8BOlympiad Bench80.4279.53 (-0.89)79.82 (-0.60)104809540 (-9.0%)9739 (-7.1%)
QWQ-32BAIME202479.1778.75 (-0.42)83.33 (+4.16)1123710105 (-13.4%)10478 (-9.5%)
QWQ-32BAIME202568.5464.16 (-4.38)65.63 (-2.91)1581114133 (-10.6%)14908 (-5.7%)
QWQ-32BAMC97.5093.75 (-3.75)95.00 (-2.50)75426526 (-13.5%)6719 (-10.9%)
QWQ-32BMath50097.0095.60 (-1.40)97.00 (-0.00)46593768 (-19.1%)3940 (-15.4%)
QWQ-32BOlympiad Bench81.9081.45 (-0.45)83.53 (+1.63)96028454 (-12.0%)8710 (-9.3%)
DeepSeek-7BAIME2457.5056.67 (-0.83)58.75 (+1.25)1123710105 (-10.1%)10478 (-6.8%)
DeepSeek-7BAIME2539.3835.42 (-3.96)36.46 (-2.92)1248911221 (-10.1%)11680 (-7.4%)
DeepSeek-7BAMC91.2590.00 (-1.25)90.63 (-0.62)54015067 (-6.2%)5145 (-4.7%)
DeepSeek-7BMath50090.6087.20 (-3.40)89.80 (-0.80)33032726 (-17.5%)2891 (-12.5%)
DeepSeek-7BOlympiad Bench69.0066.91 (-2.09)67.95 (-1.05)79137002 (-11.5%)7183 (-9.2%)
  • 反思性步骤在推理中占据相当大的一部分,在不同模型与基准中往往接近甚至超过三分之一的步骤。
  • 重新检查在反思中占比很大(约40–58%),在较简单的数据集上,它们更可能以局部验证的形式存在,而非策略性修正。
  • 大约85–95%的重新检查是确认性的,不会改变中间结果或最终答案。
  • 离线的经验池使得能够估计当前重新检查是否有利,从而实现有选择性的抑制。
  • EDS在平均意义上将推理长度降低约9%,在MATH500上最高降幅达到20.3%,同时在多模型/数据集上维持或略有提高的准确性。
  • 与完全抑制和强抑制等方法相比,EDS 能保留必要的 rethink 与有益的重新检查,取得更有利的准确性–效率权衡。
Figure 2 : Percentage of steps classified as reflections.
Figure 2 : Percentage of steps classified as reflections.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。