Skip to main content
QUICK REVIEW

[论文解读] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

Siddharth Boppana, Annabel Ma|arXiv (Cornell University)|Mar 5, 2026
Embodied and Extended Cognition被引用 0
一句话总结

论文展示了 performative chain-of-thought,在模型生成 CoT 之前就暴露内部最终答案的置信度,并且表明任务难度和模型规模等扰动会影响推理是 performative 还是真实的;同时提出基于注意力探针的早退出以提升效率。

ABSTRACT

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

研究动机与目标

  • 研究在 Chain-of-Thought (CoT) 序列中,推理型大型语言模型是否会提前暴露内部最终答案。
  • 区分 performative CoT 与在不同任务难度与模型规模下的真实逐步推理。
  • 开发并评估基于注意力的探针以从激活中解码最终答案。
  • 评估可校准的早退出的可行性,以在不降低准确性的前提下减少标记(token)使用。

提出的方法

  • 在层激活上训练注意力探针,以从推理前缀预测最终答案。
  • 在中间步骤使用强制回答提示来揭示模型的最终预测。
  • 使用 CoT 监控器检测模型是否从 CoT 前缀信号出最终答案。
  • 比较探针/强制回答信号、CoT 监控信号,以及跨任务和模型的内部信念变化。
  • 评估探针的校准性及其实现提前退出以节省 token 的能力。
Figure 1 : Early decoding helps us identify performative reasoning, when an LLM knows what it will answer. We study whether a reasoning LLM’s final answer can be decoded given a prefix of its chain of thought up to an intermediate token $x$ . We use this to identify performative reasoning , where a
Figure 1 : Early decoding helps us identify performative reasoning, when an LLM knows what it will answer. We study whether a reasoning LLM’s final answer can be decoded given a prefix of its chain of thought up to an intermediate token $x$ . We use this to identify performative reasoning , where a

实验结果

研究问题

  • RQ1注意力基探针是否能够从前缀的 chain-of-thought 解码出模型的最终答案?
  • RQ2执行性 CoT 如何随任务难度和模型规模在不同模型与基准间变化?
  • RQ3推理中的拐点(回溯、顿悟)是源于真实信念更新还是执行性行为?
  • RQ4校准探针是否能够在不牺牲准确性的前提下实现安全高效的提前退出?

主要发现

模型 / 数据集探针 vs 监控强制回答 vs 监控
DeepSeek-R1 (MMLU)0.4170.505
DeepSeek-R1 (GPQA-D)0.0120.010
GPT-OSS (MMLU)0.4350.334
GPT-OSS (GPQA-D)0.2270.185
  • 注意力探针能够从后期层激活解码最终答案;线性探针失败。
  • 更易任务(如 MMLU)呈现强烈的 performative CoT,探针/强制回答在 CoT 监控器之前就能预测结果;较难任务(如 GPQA-D)则呈现更真实的推理。
  • 推理中的拐点(回溯、顿悟)在内部置信度显著变化时更常出现,表明在许多情形下是真正的更新而非执行性行为。
  • 模型规模和任务难度会调制 performativity;更大模型和更难任务倾向于更真实的 CoT,而小模型需要更多的测试时间计算才能得出最终答案。
  • 校准的注意力探针能够实现有效的早退出,在 MMLU-Redux 上实现最高约 80% 的 token 节省,在 GPQA-Diamond 上约 30%,且保持可比的准确性。
Figure 2 : Accuracy of three early decoding methods by position of DeepSeek-R1 and GPT-OSS on MMLU-Redux and GPQA-Diamond. MMLU (left): For both models, probing and forced answering predict the models’ predictions with much higher accuracy earlier than CoT Monitoring. The CoT monitor’s accuracy rapi
Figure 2 : Accuracy of three early decoding methods by position of DeepSeek-R1 and GPT-OSS on MMLU-Redux and GPQA-Diamond. MMLU (left): For both models, probing and forced answering predict the models’ predictions with much higher accuracy earlier than CoT Monitoring. The CoT monitor’s accuracy rapi

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。