QUICK REVIEW

[논문 리뷰] Transformers are Bayesian Networks

Gregory Coppola|arXiv (Cornell University)|2026. 03. 17.

Bayesian Modeling and Causal Inference인용 수 0

한 줄 요약

논문은 시그모이드 트랜스포머가 베이지안 네트워크임을 증명하며, 그 순전파가 암시적 요인 그래프에서 신념 전파를 구현하고, 명시적 가중치와 grounding으로 정확한 BP가 가능하다는 것을 보인다.

ABSTRACT

Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network. We establish this in five ways. First, we prove that every sigmoid transformer with any weights implements weighted loopy belief propagation on its implicit factor graph. One layer is one round of BP. This holds for any weights -- trained, random, or constructed. Formally verified against standard mathematical axioms. Second, we give a constructive proof that a transformer can implement exact belief propagation on any declared knowledge base. On knowledge bases without circular dependencies this yields provably correct probability estimates at every node. Formally verified against standard mathematical axioms. Third, we prove uniqueness: a sigmoid transformer that produces exact posteriors necessarily has BP weights. There is no other path through the sigmoid architecture to exact posteriors. Formally verified against standard mathematical axioms. Fourth, we delineate the AND/OR boolean structure of the transformer layer: attention is AND, the FFN is OR, and their strict alternation is Pearl's gather/update algorithm exactly. Fifth, we confirm all formal results experimentally, corroborating the Bayesian network characterization in practice. We also establish the practical viability of loopy belief propagation despite the current lack of a theoretical convergence guarantee. We further prove that verifiable inference requires a finite concept space. Any finite verification procedure can distinguish at most finitely many concepts. Without grounding, correctness is not defined. Hallucination is not a bug that scaling can fix. It is the structural consequence of operating without concepts. Formally verified against standard mathematical axioms.

연구 동기 및 목표

sigmoid 변환기가 자신의 가중치로 정의된 암시적 요인 그래프에서 신념 전파를 수행함을 시연한다.
임의의 요인 그래프에서 명시적으로 구성된 가중치를 사용하여 정확한 신념 전파가 달성 가능함을 보인다.
시그모이드 트랜스포머에서 정확한 포스터리어를 생성하는 BP 가중치의 고유성을 확립한다.
트랜스포머 계층의 AND/OR 불(Boolean) 구조를 특성화하고 이것이 Pearl의 gather/update 알고리즘과 어떤 관련이 있는지 설명한다.
검증 가능한 추론은 유한하고 접지된 개념 공간을 필요로 하며 이를 환상(hallucination)과 연결한다.

제안 방법

시그모이드 트랜스포머를 암시적 요인 그래프에서 순전파당 한 라운드의 가중 신념 전파를 수행하는 것으로 형식적으로 해석한다.
명시적으로 구성된 BP 가중치를 구성하고 트랜스포머가 선언된 모든 요인 그래프에서 정확한 BP의 한 라운드를 구현할 수 있으며, 깊이가 전체 BP로 확장되도록 d·⌈log2 k⌉ 계층이 주어질 때 임의의 선언된 요인 그래프에서 구현한다.
시그모이드 트랜스포머가 정확한 포스터리어를 생성하면 그 가중치는 BP 가중치와 유일하게 일치해야 한다는 것을 보인다(FFN w0=w1=1, b=0; attention은 projectDim/crossProject로).
Attention은 입력의 AND-유사한 gather를 구현하고 FFN은 OR-유사한 업데이트를 구현하여 계층을 쌓으면 Pearl의 gather/update를 얻는다.
다양한 그래프 구조에서 BP 수렴과 정확성에 대한 경험적 검증을 제공하고 검증 가능한 추론을 위한 유한한 개념 공간의 필요성을 시연하며 hallucination은 grounding과 관련이 있다.

실험 결과

연구 질문

RQ1Can a sigmoid transformer be interpreted as performing Bayesian belief propagation on an implicit factor graph?
RQ2Can we construct explicit BP-weighted transformers that perform exact BP on any factor graph?
RQ3Are the BP weights unique for producing exact posteriors in sigmoid transformers?
RQ4What is the Boolean structure (AND/OR) of transformer layers and how does it relate to Pearl’s algorithm?
RQ5What role does grounding and a finite concept space play in verifiable inference and hallucination?

주요 결과

그래프	BP 정확	트랜스포머	최대 오차
0	[0.7349, 0.4366]	[0.7338, 0.4346]	0.0021
1	[0.4097, 0.4036]	[0.4096, 0.4031]	0.0005
4	[0.6459, 0.8298]	[0.6436, 0.8297]	0.0023
9	[0.4084, 0.5523]	[0.4084, 0.5526]	0.0003

A forward pass of any sigmoid transformer implements one round of weighted belief propagation on an implicit factor graph G(W).
A transformer with explicitly constructed weights can implement one round of exact BP on any factor graph, with depth scaling to full BP given d·⌈log2 k⌉ layers for k-ary factors.
If a sigmoid transformer produces exact posteriors on a grounded factor graph, its weights are uniquely the BP weights (FFN weights w0=w1=1, b=0; attention as projectDim/crossProject).
Attention implements the AND-like gather of inputs, and the FFN implements the OR-like update, yielding Pearl’s gather/update across layers.
There is experimental confirmation: convergence to exact posteriors on loopy graphs and trees, and a finite concept space is required for verifiable inference; hallucination is tied to grounding.
Empirical results include Table 1 comparing BP exact vs. Transformer posteriors on held-out graphs, showing small max errors across graphs.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.