[论文解读] LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models
本论文对两种流行的大语言模型可解释性方法进行了批判性测试——基于注意力的 token 关系和基于嵌入的属性推断——发现即使在潜在语义解释因方法性伪影和数据集结构而失败时,它们也可能产生令人信服的结果。
Large Language Models (LLMs) are becoming increasingly popular in pervasive computing due to their versatility and strong performance. However, despite their ubiquitous use, the exact mechanisms underlying their outstanding performance remain unclear. Different methods for LLM explainability exist, and many are, as a method, not fully understood themselves. We started with the question of how linguistic abstraction emerges in LLMs, aiming to detect it across different LLM modules (attention heads and input embeddings). For this, we used methods well-established in the literature: (1) probing for token-level relational structures, and (2) feature-mapping using embeddings as carriers of human-interpretable properties. Both attempts failed for different methodological reasons: Attention-based explanations collapsed once we tested the core assumption that later-layer representations still correspond to tokens. Property-inference methods applied to embeddings also failed because their high predictive scores were driven by methodological artifacts and dataset structure rather than meaningful semantic knowledge. These failures matter because both techniques are widely treated as evidence for what LLMs supposedly understand, yet our results show such conclusions are unwarranted. These limitations are particularly relevant in pervasive and distributed computing settings where LLMs are deployed as system components and interpretability methods are relied upon for debugging, compression, and explaining models.
研究动机与目标
- 评估基于注意力的关系解释是否真正揭示跨变换器层的 token 级语义
- 评估嵌入空间是否能可靠地从嵌入中解码出人类可解释的语义特征
提出的方法
- 复现标准的基于注意力的解释流程并测试 token 连续性与信息流假设
- 用两种常用模型(PLSR 和 FFNN)复现嵌入到特征的范数映射,以及标准特征范数数据集(McRae、Buchanan、Binder)
- 引入受控消融与健全性检查(随机/打乱特征、上界映射、类别结构破坏)以挑战底层假设
- 使用文献中使用的度量进行评估(注意力可视化;F1@10、Spearman’s 相关、Neighborhood Accuracy @10),并报告假设失效之处
- 通过负结果为方法论提供指引,避免对可解释性输出过度解读

实验结果
研究问题
- RQ1更晚层的变换器表示是否以一种支持 token 级关系解释的方式保持 token 身份?
- RQ2嵌入空间是否以一种可以被标准映射方法可靠解码的人类可解释语义属性?
- RQ3方法性伪影(如稀疏性、上界、几何聚类)在多大程度上驱动可解释性分数?
- RQ4鲁棒性控制与消融如何影响注意力和基于嵌入的方法的感知解释能力?
主要发现
| Norm | Sys | Upper | Shuffle | Shuf-Upper | Rand |
|---|---|---|---|---|---|
| McRae (F1@10) | 0.25 | 0.27 | 0.10 | 0.13 | 0.01 |
| Buchanan (F1@10) | 0.18 | 0.22 | 0.06 | 0.11 | 0.01 |
| Binder (rho) | 0.74 | 0.90 | 0.30 | 0.59 | 0.01 |
- 基于注意力的解释在更深层表示中失去 token 身份,因为表示在上游位置之间混合
- 注意力可视化在 token 身份被破坏时仍保持表观结构,挑战其作为真实关系解释的用途
- 基于嵌入的属性推断在打乱、损坏或随机特征上也能产生高预测分数,驱动因素是数据集几何与稀疏性而非语义内容
- 邻域分析显示该方法主要捕捉几何相似性而非真正的语义解码
- 方法性上限与消融揭示,许多可解释性声称是数据与流程的伪影,而非内部语义知识的证据
- 负向结果强调在可解释性工作中需要显式的假设检验,特别是在普及与边缘计算场景中的部署时

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。