QUICK REVIEW

[논문 리뷰] LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models

Alhassan Abdelhalim, Janick Edinger|arXiv (Cornell University)|2026. 01. 30.

Explainable Artificial Intelligence (XAI)인용 수 0

한 줄 요약

이 논문은 두 가지 인기 있는 LLM 해석 방법—어텐션 기반의 토큰 관계와 임베딩 기반 속성 추론—을 비판적으로 검증하고, 방법론적 인공물과 데이터 셋 구조로 인해 기저 의미 설명이 실패하더라도 설득력 있는 결과를 낼 수 있음을 발견한다.

ABSTRACT

Large Language Models (LLMs) are becoming increasingly popular in pervasive computing due to their versatility and strong performance. However, despite their ubiquitous use, the exact mechanisms underlying their outstanding performance remain unclear. Different methods for LLM explainability exist, and many are, as a method, not fully understood themselves. We started with the question of how linguistic abstraction emerges in LLMs, aiming to detect it across different LLM modules (attention heads and input embeddings). For this, we used methods well-established in the literature: (1) probing for token-level relational structures, and (2) feature-mapping using embeddings as carriers of human-interpretable properties. Both attempts failed for different methodological reasons: Attention-based explanations collapsed once we tested the core assumption that later-layer representations still correspond to tokens. Property-inference methods applied to embeddings also failed because their high predictive scores were driven by methodological artifacts and dataset structure rather than meaningful semantic knowledge. These failures matter because both techniques are widely treated as evidence for what LLMs supposedly understand, yet our results show such conclusions are unwarranted. These limitations are particularly relevant in pervasive and distributed computing settings where LLMs are deployed as system components and interpretability methods are relied upon for debugging, compression, and explaining models.

연구 동기 및 목표

어텐션 기반 관계 설명이 트랜스포머 계층 전반에 걸쳐 토큰 수준의 의미를 진정으로 드러내는지 평가한다.
임베딩 기반 속성 추론이 임베딩에서 인간이 해석 가능한 의미 특성을 신뢰할 수 있게 디코딩하는지 평가한다.

제안 방법

표준 어텐션 기반 설명 파이프라인을 재현하고 토큰 연속성 및 정보 흐름 가정을 검증한다.
두 가지 일반 모델(PLSR 및 FFNN)과 표준 특징-노름 데이터세트(McRae, Buchanan, Binder)를 사용하여 임베딩-특성 노름 매핑을 재현한다.
기저 가정을 검증하기 위해 제어된 차감 및 정상성 점검(무작위/셔플 특성, 상한 매핑, 계통학적 왜곡)을 도입한다.
문헌에서 사용된 지표들(어텐션 시각화; F1@10, Spearman의 rho, Neighborhood Accuracy @10)을 사용해 평가하고 가정이 실패하는 지점을 보고한다.
설명 가능성 출력의 과잉 해석을 피하기 위해 음성 결과를 통한 방법론적 지침을 제공한다.

LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models

실험 결과

연구 질문

RQ1후기 계층의 트랜스포머 표현이 토큰 아이덴티티를 보존하여 토큰 수준의 관계 설명을 지원하는가?
RQ2표준 매핑 방법으로 신뢰할 수 있게 디코딩될 만큼 임베딩 공간이 인간이 해석 가능한 의미 특성을 인코딩하는가?
RQ3방법론적 인공물(예: 희소성, 상한, 기하학적 군집)이 해석 가능성 점수에 어느 정도 영향을 주는가?
RQ4강건성 제어와 차감이 어텐션 및 임베딩 기반 접근법의 설명력에 어떤 영향을 미치는가?

주요 결과

정규화	시스템	상한	셔플	셔플-상한	랜덤
McRae (F1@10)	0.25	0.27	0.10	0.13	0.01
Buchanan (F1@10)	0.18	0.22	0.06	0.11	0.01
Binder (rho)	0.74	0.90	0.30	0.59	0.01

어텐션 기반 설명은 표현이 상위 위치들 간에 혼합되며 더 깊은 계층에서 토큰 아이덴티티를 잃는다.
어텐션 시각화는 토큰 아이덴티티가 손상되더라도 겉으로 보이는 구조를 유지해 실제 관계 설명으로서의 사용에 도전한다.
임베딩 기반 속성 추론은 섞이거나 손상되었거나 무작위 특성에서도 높은 예측 점수를 나타내며, 이는 의미 콘텐츠가 아닌 데이터세트 기하학 및 희소성에 의해 주도된다.
네이버후드 분석은 이 방법이 실제 의미 해독이 아니라 기하학적 유사성만을 포착하는 것을 보인다.
방법론적 상한 및 차감은 많은 해석 가능성 주장이 내부 의미 지식의 증거가 아니라 데이터와 파이프라인의 인공물임을 시사한다.
음성 결과는 해석 가능성 연구에서 명시적 가정 검증의 필요성을 강조하며, 특히 광범위하고 엣지 컴퓨팅 맥락에서의 배치를 위한 경우에 더 그렇다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.