QUICK REVIEW

[論文レビュー] Language Model Circuits Are Sparse in the Neuron Basis

Aryaman Arora, Zhengxuan Wu|arXiv (Cornell University)|Jan 30, 2026

Explainable Artificial Intelligence (XAI)被引用数 0

ひとこと要約

本論文は、MLP活性化（ニューロン基盤）がMLP出力よりも疎で忠実な回路を生み出すことを示し、RelP帰因を用いて causal circuitry を特定し、回路追跡においてSAEベースの方法と一致する。

ABSTRACT

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as extit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as extit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that extbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $ o$ state $ o$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state'), and can be steered to change the model's output. This work thus advances automated interpretability of language models without additional training costs.

研究の動機と目的

ニューロン基盤の表現（MLP活性化）が、疎で忠実な回路をSAEと同等に生み出せるかを調査する。
ニューロン活性化と勾配ベースの帰因を用いたエンドツーエンドの回路追跡パイプラインを開発する。
ニューロン基盤の回路とSAEベースの回路を標準ベンチマークと非対等データ設定で比較する。
Llama 3.1 8B Instructにおける主語-述語一致や多跳推論などのタスクでニューロン基盤の回路追跡の有用性を示す。

提案手法

回路ノードをMLP活性化、MLP出力、アテンション出力、残差ストリーム、およびSAE特徴量で表現する。
Integrated Gradients (IG) と RelP（1パス勾配ベース帰因法）を用いてノード重要度を評価する。RelPは忠実な帰因のため非線形性を線形近似で置換する。
回路の忠実度と完全性を、回路の補集合に対する平均アブレーションで評価し、基準と比較して正規化する。
最も寄与の大きいノードを貪欲に選択して疎な回路を形成し、回路サイズkを変化させる。
基盤横断比較のためLlama Scopeの8x幅SAEを再現する。
RelPをニューネットルレベルのノードとエッジレベルの帰因の両方に適用し、エッジフロー正規化指標も含める。

Figure 1 : Faithfulness and completeness for different choices of representation in the model (residual stream, attention, MLP activations, or MLP outputs) and basis (neurons or SAE) when applying Integrated Gradients, averaged over the 4 SVA tasks with paired data.

実験結果

リサーチクエスチョン

RQ1MLP活性化ニューロンはSAEベースの特徴と比べてより疎だが忠実な回路を提供できるか。
RQ2RelPはニューロンベースの回路追跡においてIGより忠実度/完全性を改善するか。
RQ3ニューロン基盤の回路は非対データにも一般化し、CLTベース研究の知見を再現できるか。
RQ4ニューロン基盤回路のエッジの性質はどうか、RelPはIGより忠実なエッジを特定できるか。
RQ5ニューロンレベルの回路追跡は多跳推論やモデル出力の操作性を解釈可能にするか。

主な発見

MLP活性化はMLP出力よりはるかに疎な回路を生み出す（約100倍小さい）一方でモデル挙動に対して忠実。
RelPはMLP活性化とSAE回路のギャップを縮め、SVAタスクで約200ニューロン程度でほぼ完全な忠実度を達成。
RelPはIGに対して paired/unpaired の両データ設定で上回り、忠実度と時には完全性を向上。
RelPによるエッジ帰因（stop-gradients付き）は高い忠実度（>80%）とエッジ集合の大幅な削減（候補エッジの約10%程度）という最適バランスを達成。
Llama 3.1 8B Instructのニューロンレベル回路は層間トランスコーダの結果を再現し、特定ニューロン群を標的とすることでモデル出力の steering を可能にする。
テキサス州の首都を巡る多跳推論の事例研究では、prior CLTの所見に対応する解釈可能なニューロン群を示し、出力のターゲット steering を可能にする。

Figure 2 : Faithfulness and completeness for Integrated Gradients vs. RelP, for different choices of representation in the model and basis (neurons or SAE), averaged over the 4 SVA tasks with paired data

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。