QUICK REVIEW

[논문 리뷰] Transformers converge to invariant algorithmic cores

Joshua S. Schiffman|arXiv (Cornell University)|2026. 02. 26.

Natural Language Processing Techniques인용 수 0

한 줄 요약

논문은 트랜스포머 작업 성능에 필요한 저차원의 알고리즘 핵심들을 추출하고, 핵심들이 독립적 실행에서도 수렴하며, GPT-2 규모에서 1차원 주어-동사 일치(core) 핵심을 보여주어 불변의 계산 구조를 드러낸다.

ABSTRACT

Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.

연구 동기 및 목표

훈련 대상이 내부 회로가 아니라 행동을 목표로 한다는 비일치성 문제를 동기화한다.
작업 성능에 필요하고 충분한 저차원 부분공간으로서 알고리즘 핵심을 추출하는 방법을 개발한다.
독립적으로 학습된 트랜스포머들이 내부 가중치가 서로 다름에도 불구하고 유사한 핵심으로 수렴한다는 것을 보여준다.
Markov 체인, 모듈러 덧셈, 그리고 GPT-2 언어 모델을 포함한 점진적으로 더 복잡한 설정에 핵심 추출을 적용한다.
보편적이고 1차원인 일치 핵심이 GPT-2 규모 전반에 걸쳐 주어–동사 수 일치를 지배한다.

제안 방법

절삭 실험(ablation)을 통해 작업 성능에 필요하고 충분한 저차원 부분공간으로서 알고리즘 핵심(core)을 정의한다.
ACE (Algorithmic Core Extraction) 를 사용하여 모델 숨겨진 상태에서 핵심을 추출하고 충분성/필요성을 테스트한다.
핵심 좌표 내에서 선형 연산자를 맞춰 작업 역학을 복구하고, 스펙트럼을 기반의 ground-truth 역학과 비교한다.
지오메트릭 및 통계적 정렬(프로젝터 중첩, 주성분 각도, CCA)으로 독립적으로 학습된 모델 간 코어를 비교한다.
모듈식 덧셈의 경우, grokking 동안 코어 형성을 분석하고 가중치 감쇠(weight decay) 하에 코어 확장이 어떻게 발생하는지 추적한다.
GPT-2 Small/Medium/Large에 핵심 추출을 적용하여 1차원 일치 코어를 식별하고 인과적 개입(필요성, 충분성, 뒤집기)을 테스트한다.

Figure 1 : Transformers trained on the same Markov task converge to a low-dimensional, causal algorithmic core. Three one-layer transformer language models with identical architectures ( $d_{\rm model}=64$ , $d_{\rm ff}=256$ , $|V|=4$ ) were initialized with independent random seeds and trained on t

실험 결과

연구 질문

RQ1트랜스포머에 작업 성능에 필요하고 충분한 저차원 알고리즘 코어가 존재하는가?
RQ2독립적으로 학습된 서로 다른 가중치를 가진 모델 간에도 이러한 코어가 공유되는가?
RQ3코어 내의 내부 역학을 기계적으로(예: Markovian 또는 회전 연산자로) 특성화할 수 있는가?
RQ4주어–동사 일치와 같은 언어 계산에 대해 보편적인 코어가 GPT-2 규모 전반에 존재하는가?

주요 결과

같은 Markov 과제에 대해 독립적으로 학습된 한 층 트랜스포머가 성능에 필요한 3D 코어로 수렴한다.
독립 모델의 코어는 기하학적으로 정렬되지 않지만 통계적으로 정렬되어 있으며, 코어 차원 전반에 걸쳐 거의 단일한 정준상관관계에 가깝다.
코어 내부의 선형 역학을 적합시키면 ground-truth Markov 스펙트럼을 회복하고 고유값이 Markov 전이 행렬과 일치한다( Perron–Frobenius 고유값 제외).
모듈식 덧셈의 경우, grokking 단계에서 코어가 형성되고 회전 기저를 드러내며, 가중치 감쇠가 계속될 때 분산된 중복 모드로 인해 코어 확장이 발생한다.
GPT-2 모델(Small, Medium, Large)에서 주어–동사 일치를 지배하는 단일 1D 일치 코어가 있으며, 이 코어를 교란하거나 뒤집으면 열린 생성에서 문법적 수를 안정적으로 조절하거나 반전시킨다.
GPT-2 규모 전반에 걸쳐 코어 좌표가 잘 정렬되어 있어, 모델 간에 보편적이고 공유된 문법 수 인코딩이 있음을 시사한다.

Figure 2 : Modular addition cores form at grokking and are defined by automatically recoverable rotational operations. Three two-layer transformers with equivalent architectures ( $d_{\rm model}=128$ , $d_{\rm ff}=512$ ) were initialized with independent random seeds and trained for $2\times 10^{3}$

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.