QUICK REVIEW

[论文解读] Open Problems in Mechanistic Interpretability

Lee Sharkey, Bilal Chughtai|ArXiv.org|Jan 27, 2025

Natural Language Processing Techniques被引用 5

一句话总结

面向未来的综述，概述机制性可解释性在方法论、应用和社会技术方面尚待解决的开放问题，重点放在反向工程、基于概念的方法以及管线自动化。

ABSTRACT

Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.

研究动机与目标

澄清机制性可解释性在理解神经网络泛化方面的目标。
综述当前方法（反向工程与基于概念的可解释性）及其开放问题。
确定将电路发现程序化和自动化可解释性研究的实际步骤。
讨论以应用为导向的目标，如监控安全性、控制行为和预测模型能力。
解决与机制性可解释性相关的社会-技术与治理问题。

提出的方法

将反向工程视为通过分解、描述和验证来识别网络组件的角色。
将基于概念的可解释性视为通过概念和探针为给定角色识别组件。
评估分解方法（维度压缩、稀疏字典学习、SDL）及其局限性。
批判性分析线性表示假设和稀疏性作为可解释性的代理的有效性。
提出将机制性可解释性程序化为电路发现流程与自动化途径。

Figure 1 : Two approaches to neural network interpretability. (Left) Reverse Engineering is characterized by decomposing networks into functional components and describing how those components interact to produce the network’s behavior. It thus aims to ‘identify the roles of network components’ ( Se

实验结果

研究问题

RQ1在识别网络组件角色的方法与基础方面，主要的开放问题是什么？
RQ2在可靠地识别指定概念的网络组件方面，基于概念的探针有哪些局限性？
RQ3如何将机制性可解释性程序化为电路发现流程与自动化工作流？
RQ4将机制性可解释性应用于对AI系统的监控、控制与预测的关键挑战与机遇有哪些？
RQ5随着机制性可解释性的发展，会引发哪些社会-技术与治理方面的问题？

主要发现

SDL是最受欢迎的无监督分解方法，但存在实用性和概念上的显著局限。
许多分解依赖线性表示假设，而这一假设在不同模型中并不普遍有效。
SDL将稀疏性作为可解释性的代理，但由于特征分裂、吸收与组合等原因，这不一定始终成立。
当前的分解方法并不能直接揭示潜在机制；它们识别激活，而不是精确机制。
表示可能分布在超出神经元或层的架构组件中，使分解更加复杂。
需要改进的理论基础和可扩展、架构感知的分解方法。

Figure 2 : The steps of reverse engineering neural networks. (1) Decomposing a network into simpler components. This decomposition might not necessarily use architecturally-defined bases, such as individual neurons or layers ( Section ˜ 2.1.2 ). (2) Hypothesizing about the functional roles of some o

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。