QUICK REVIEW

[논문 리뷰] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

Siddharth Boppana, Annabel Ma|arXiv (Cornell University)|2026. 03. 05.

Embodied and Extended Cognition인용 수 0

한 줄 요약

이 논문은 모델이 생성된 CoT보다 내부 최종 답에 대한 확신을 더 일찍 드러내는 수행적 체인-오브-생각(performative chain-of-thought)을 보여주고, 작업 난이도와 모델 크기 같은 요인이 추론이 수행적이거나 진정한지에 영향을 준다는 것을 입증하며, 효율성을 위한 어텐션-프로브 기반의 조기 종료도 제안합니다.

ABSTRACT

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

연구 동기 및 목표

논리적 추론을 수행하는 LLM이 체인-오브-생각(CoT) 시퀀스에서 내부의 최종 답을 조기에 드러내는지 확인한다.
작업 난이도와 모델 크기에 따라 수행적 CoT와 실제 단계적 추론을 구분한다.
활성화에서 최종 답변을 해독하기 위한 어텐션 기반 프로브를 개발하고 평가한다.
정확도를 해치지 않으면서 토큰 사용을 줄이기 위한 차보정된 조기 종료의 가능성을 평가한다.

제안 방법

추론 접두사에서 최종 답을 예측하기 위해 층 활성화에 어텐션 프로브를 학습한다.
중간 단계에서 강제 답변 프롬프트를 사용하여 모델의 최종 예측을 드러낸다.
CoT 모니터를 활용하여 모델이 CoT 접두사에서 최종 답을 신호할 때를 탐지한다.
작업과 모델 전반에 걸쳐 프로브/강제 답변 신호, CoT 모니터 신호, 내부 신념의 변화를 비교한다.
프로브의 보정 정도와 토큰 절감이 가능한 조기 종료를 가능하게 하는 능력을 평가한다.

Figure 1 : Early decoding helps us identify performative reasoning, when an LLM knows what it will answer. We study whether a reasoning LLM’s final answer can be decoded given a prefix of its chain of thought up to an intermediate token $x$ . We use this to identify performative reasoning , where a

실험 결과

연구 질문

RQ1어텐션 기반 프로브가 체인-오브-생각의 접두사로부터 모델의 최종 답을 해독할 수 있는가?
RQ2다른 모델과 벤치마크에서 작업 난이도와 모델 크기에 따라 수행적 CoT가 어떻게 달라지는가?
RQ3추론의 변곡점이 실제 신념 업데이트에 해당하는가 아니면 수행적 행위인가?
RQ4보정된 프로브가 정확도를 해치지 않으면서 안전하고 효율적인 조기 종료를 가능하게 하는가?

주요 결과

모델 / 데이터셋	프로브 vs 모니터	강제 vs 모니터
DeepSeek-R1 (MMLU)	0.417	0.505
DeepSeek-R1 (GPQA-D)	0.012	0.010
GPT-OSS (MMLU)	0.435	0.334
GPT-OSS (GPQA-D)	0.227	0.185

어텐션 프로브는 후대 레이어 활성화로부터 최종 답을 해독할 수 있지만 선형 프로브는 실패한다.
더 쉬운 작업(MMLU 등)은 강한 수행적 CoT를 보이며, 프로브/강제 답변이 CoT 모니터보다 먼저 예측하는 반면, 더 어려운 작업(GPQA-D)은 더 진정한 추론을 보인다.
변곡점(되돌림, 깨달음)은 주로 내부 확신이 바뀔 때 발생하여 다수의 경우 수행적 행위가 아닌 실제 업데이트를 시사한다.
모델 크기와 작업 난이도는 수행성에 영향을 미친다; 더 큰 모델과 더 어려운 작업은 더 충실한 CoT로 기울고, 더 작은 모델은 최종 답에 도달하기 위해 더 많은 테스트 시점 계산이 필요하다.
보정된 어텐션 프로브는 효과적인 조기 종료를 가능하게 하며, MMLU-Redux에서 최대 80%의 토큰 절감, GPQA-Diamond에서 약 30%의 절감과 유사한 정확도를 달성한다.

Figure 2 : Accuracy of three early decoding methods by position of DeepSeek-R1 and GPT-OSS on MMLU-Redux and GPQA-Diamond. MMLU (left): For both models, probing and forced answering predict the models’ predictions with much higher accuracy earlier than CoT Monitoring. The CoT monitor’s accuracy rapi

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.