QUICK REVIEW

[论文解读] Finding Neurons in a Haystack: Case Studies with Sparse Probing

Wes Gurnee, Neel Nanda|arXiv (Cornell University)|May 2, 2023

Topic Modeling被引用 13

一句话总结

本文提出稀疏探针以定位在 LLM 激活中编码人类可解释特征的单个神经元或稀疏神经元子集，分析表示随模型规模的演变，并在多模型和多特征上给出案例研究。

ABSTRACT

Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale. With $k=1$, we localize individual neurons which are highly relevant for a particular feature, and perform a number of case studies to illustrate general properties of LLMs. In particular, we show that early layers make use of sparse combinations of neurons to represent many features in superposition, that middle layers have seemingly dedicated neurons to represent higher-level contextual features, and that increasing scale causes representational sparsity to increase on average, but there are multiple types of scaling dynamics. In all, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 70 million to 6.9 billion parameters.

研究动机与目标

研究高层次的人类可解释特征在LLMs内部神经元激活中的表示方式。
开发并应用 k-sparse 探针以定位对特定特征负责的神经元。
考察特征表示如何随模型规模和层级变化，揭示稀疏性动态。
提供案例研究以说明叠加、单义性以及规模对表示的影响。

提出的方法

在内部激活上训练 k-sparse 线性分类器（探针）以预测输入特征。
使用自适应阈值和最优稀疏探针（OSP）选择具有高预测能力的前 k 个神经元。
对比探针方法并在保留数据上评估精确度、召回率、F1、MCC。
分析变换器的 MLP 层激活（约占参数的 2/3）以定位与特征相关的神经元。

实验结果

研究问题

RQ1LLMs 中高层次特征是否更多由单个神经元表示，还是由稀疏组合（叠加）共同表示？
RQ2表示的稀疏性及与特征对齐神经元的位置如何随模型规模和层级变化？
RQ3中间层神经元是否倾向于对高层上下文特征具备单义性，而早期层则表现出更多的叠加？
RQ4在跨越多种特征类型的探针中，稀疏探针识别特征-神经元对的可靠性与可解释性如何？
RQ5在探查神经元表示时有哪些方法论考虑与潜在混淆因素？

主要发现

存在既有单义性神经元，也有参与多特征叠加的多义性神经元。
早期层显示对多种特征的稀疏组合的叠加表示，而中间层则存在对高层上下文特征似乎专门的神经元。
随着模型规模扩大，平均表示稀疏性增加，但不同特征类型遵循不同的缩放动态。
模型规模增加会为某些特征带来更细粒度的表示，并可能因特征分裂或专用电路而降低其他特征的稀疏性。
本研究在覆盖 7 个模型、范围从 70M 到 6.9B 参数的特征上进行研究，证明稀疏表示的存在性与动态性。

(b) A polysemantic neuron activating on six unrelated $n$ -grams

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。