QUICK REVIEW

[논문 리뷰] Me, Myself, and $π$ : Evaluating and Explaining LLM Introspection

Atharv Naphade, Samarth Bhargav|arXiv (Cornell University)|2026. 03. 17.

Explainable Artificial Intelligence (XAI)인용 수 0

한 줄 요약

논문은 LLM의 정책에 대한 잠재적 추론으로서 introspection을 형식화하고, 이를 엄격히 시험하기 위한 Introspect-Bench를 도입하며, 최전선 모델들이 주의 확산(attention-diffusion) 메커니즘을 통해 자신의 정책에 대한 특권적 접근을 보유함을 보인다.

ABSTRACT

A hallmark of human intelligence is Introspection-the ability to assess and reason about one's own cognitive processes. Introspection has emerged as a promising but contested capability in large language models (LLMs). However, current evaluations often fail to distinguish genuine meta-cognition from the mere application of general world knowledge or text-based self-simulation. In this work, we propose a principled taxonomy that formalizes introspection as the latent computation of specific operators over a model's policy and parameters. To isolate the components of generalized introspection, we present Introspect-Bench, a multifaceted evaluation suite designed for rigorous capability testing. Our results show that frontier models exhibit privileged access to their own policies, outperforming peer models in predicting their own behavior. Furthermore, we provide causal, mechanistic evidence explaining both how LLMs learn to introspect without explicit training, and how the mechanism of introspection emerges via attention diffusion.

연구 동기 및 목표

LLM의 자체 정책 기능에 대한 정확한 신념으로 정책 introspection을 형식화한다.
침투적 의도를 단기 정책, 장기 정책, 역 정책 측면으로 분해한다.
Introspect-Bench를 제공하여 고찰적 추론을 외부 추론으로부터 분리한다.
최전선 모델을 실험적으로 평가하고 교차 모델 introspection 능력을 분석한다.
명시적 훈련 없이 introspection이 어떻게 출현하는지에 대한 기계적 설명을 제공한다.

제안 방법

정책-대-기계적-직관을 구분하기 위해 f-introspection과 (f, θ)-introspection을 정의한다.
단기, 장기, 역 introspection을 겨냥하는 과제로 Introspect-Bench를 제안한다.
기억화 아티팩트를 피하기 위해 오픈-ended 과제에서 다양한 최전선 모델을 평가한다.
자기 정책에 대한 특권적 접근을 보여주기 위해 교차 모델 보정을 사용한다.
KL 발산 비교(p 대 p′ 대 p*)를 통해 장기 introspection의 등장 여부를 분석한다.
주의 확산을 통한 기계적 설명(로그리트 렌즈 및 주의 패턴 분석)을 제공한다.

실험 결과

연구 질문

RQ1LLM은 자신의 정책과 구성 요소에 대해 정확한 신념을 형성할 수 있는가?
RQ2최전선 모델은 동료 모델과 비교하여 자신의 정책에 대한 특권적 접근을 보이는가?
RQ3고찰은 명시적 훈련으로부터 나타나는가, 아니면 표준 훈련에서 자발적으로 나타나는가?
RQ4고찰의 기계적 과정은 무엇인가(예: 주의 확산)?

주요 결과

모델	K번째 단어	CoT 예측	패러프레이즈	헤드업	평균
xAI Grok 4.1 Fast	57.0%	58.63%	60.69%	91.43%	66.94%
Meta Llama 3.3 70B Instruct	60.4%	70.29%	42.19%	93.88%	66.69%
OpenAI GPT-4o	55.8%	62.99%	47.12%	99.18%	66.27%
Qwen Qwen3 235B	56.4%	65.07%	42.43%	96.53%	65.11%
OpenAI GPT-4.1 Mini	58.6%	67.98%	42.2%	91.02%	64.95%
Self Introspection	54.55%	68.69%	39.07%	94.43%	64.19%
Google Gemini 3 Flash Preview	42.6%	64.03%	46.33%	97.55%	62.63%
Google Gemini 2.5 Flash	56.0%	57.32%	39.08%	97.35%	62.44%
OpenAI GPT-4o Mini	50.6%	62.66%	36.44%	96.33%	61.51%
Google Gemini 2.0 Flash 001	47.8%	61.39%	41.47%	95.31%	61.49%
NousResearch Hermes 4 405B	38.2%	54.14%	36.26%	94.49%	55.77%

최전선 모델은 자신의 정책에 대한 특권적 접근을 보이며 자기 예측 과제에서 동료를 능가한다.
Introspect-Bench 과제는 다양하며, 한 과제에서의 높은 성과가 다른 과제로의 전이성을 보장하지 않는다.
장기 고찰은 고찰적 프롬프트를 사용할 때(비고찰적 프롬프트 대비) 장기 정책 행동에 대한 잠재적 접근이 크게 개선됨을 보인다.
주된 메커니즘으로 주의 확산이 고찰에 기여하는 인과적 요인임을 뒷받침하는 증거가 있으며, 60층 레이어가 발산에 결정적이다.
명시적 감독 없이도 고찰이 나타날 수 있음을 보여주는 미세조정 실험이 자기 예측 능력을 유발한다.
주의 확산은 고찰적 추론 중 관찰되는 로짓 이동의 의미 있는 부분을 설명한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.