QUICK REVIEW

[논문 리뷰] Language Model Circuits Are Sparse in the Neuron Basis

Aryaman Arora, Zhengxuan Wu|arXiv (Cornell University)|2026. 01. 30.

Explainable Artificial Intelligence (XAI)인용 수 0

한 줄 요약

논문은 MLP 활성화(뉴런 기반 표현)가 MLP 출력보다 더 희소하고 더 충실한 회로를 산출하며, RelP 기반 특성화로 원인 회로를 찾아 회로 추적에서 SAE 기반 방법과 일치한다.

ABSTRACT

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as extit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as extit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that extbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $ o$ state $ o$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state'), and can be steered to change the model's output. This work thus advances automated interpretability of language models without additional training costs.

연구 동기 및 목표

뉴런 기반 표현(MLP 활성화)이 희소하고 충실한 회로를 SAEs와 비슷하게 생성할 수 있는지 조사한다.
뉴런 활성화와 그래디언트 기반 속성을 사용해 인과 회로를 찾는 엔드-투-엔드 회로 추적 파이프라인을 개발한다.
표준 벤치마크와 비짝 데이터 설정에서 뉴런 기반 회로와 SAE 기반 회로를 비교한다.
Llama 3.1 8B Instruct에서 주어-동사 일치 및 다단 추론과 같은 과제에 뉴런 기반 회로 추적의 유용성을 시연한다.

제안 방법

회로 노드를 MLP 활성화, MLP 출력, 어텐션 출력, 잔차 스트림, 및 SAE 특징으로 표현한다.
통합 그래디언트(IG)와 RelP(일회 패스 그래디언트 기반 속성 방법)를 사용해 노드 중요도를 평가한다; RelP는 충실한 속성을 위해 비선형성을 선형 근사로 대체한다.
회로의 여집합에 대한 평균 제거로 회로를 평가하고, 이전 연구와 마찬가지로 충실도와 완전성을 측정한다(기준선 대비 정규화).
상위 속성 노드를 탐욕적으로 선택해 희소 회로를 형성하며 회로 크기 k를 달리한다.
기초 간 비교를 위해 Llama Scope의 8배 너비 SAE로 SAE를 재현한다.
뉴런 수준 노드와 엣지 수준 속성 모두에 RelP를 적용하고, 엣지 흐름 정규화 지표를 포함한다.

Figure 1 : Faithfulness and completeness for different choices of representation in the model (residual stream, attention, MLP activations, or MLP outputs) and basis (neurons or SAE) when applying Integrated Gradients, averaged over the 4 SVA tasks with paired data.

실험 결과

연구 질문

RQ1MLP 활성화 뉴런이 SAE 기반 특징과 비교해 더 희소하면서도 충실한 회로를 제공할 수 있는가?
RQ2RelP가 뉴런 기반 회로 추적에서 IG보다 충실도/완전성을 향상시키는가?
RQ3뉴런 기반 회로가 비짝 데이터에 일반화되며 CLT 기반 연구의 발견을 재현하는가?
RQ4뉴런 기반 회로에서 엣지의 본질은 무엇이며, RelP가 IG보다 더 충실한 엣지를 식별할 수 있는가?
RQ5LLM에서 해석 가능한 다단 추론 및 모델 출력을 특정 뉴런 클러스터로 조정하는 효과를 회로 추적이 밝혀낼 수 있는가?

주요 결과

MLP 활성화는 MLP 출력보다 훨씬 더 희소한 회로를 생성하지만 모델 동작에 충실하다(약 100배 작음).
RelP는 MLP 활성화와 SAE 회로 간의 간극을 줄여 SVA 작업에서 약 200개의 뉴런으로 거의 완벽한 충실성을 달성한다.
RelP는 짝 데이터와 비짝 데이터 설정 모두에서 IG보다 우수하며 충실도와 때로는 완전성을 개선한다.
RelP(스톱-그래디언트 포함를 포함한 엣지 속성)는 높은 충실도(>80%)를 유지하면서도 엣지 집합을 크게 줄여 최적의 균형을 달성한다(~후보 엣지의 약 10%).
LLama 3.1 8B Instruct의 뉴런 수준 회로는 레이어 간 트랜스코더의 결과를 재현하고 특정 뉴런 클러스터를 겨냥해 모델 출력을 조종할 수 있게 한다.
텍사스 주도 다중 단계 추론 과제에 대한 사례 연구는 이전 CLT 결과에 상응하는 해석 가능한 뉴런 클러스터를 밝혀내고 출력의 대상 조종을 가능하게 한다.

Figure 2 : Faithfulness and completeness for Integrated Gradients vs. RelP, for different choices of representation in the model and basis (neurons or SAE), averaged over the 4 SVA tasks with paired data

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.