[论文解读] Transformers are Bayesian Networks
该论文证明了一个 sigmoid transformer 是一个贝叶斯网络,展示其前向传播在一个隐式因子图上实现信念传播,且使用显式权重和接地可以实现精确的 BP。
Transformers are the dominant architecture in AI, yet why they work remains poorly understood. This paper offers a precise answer: a transformer is a Bayesian network. We establish this in five ways. First, we prove that every sigmoid transformer with any weights implements weighted loopy belief propagation on its implicit factor graph. One layer is one round of BP. This holds for any weights -- trained, random, or constructed. Formally verified against standard mathematical axioms. Second, we give a constructive proof that a transformer can implement exact belief propagation on any declared knowledge base. On knowledge bases without circular dependencies this yields provably correct probability estimates at every node. Formally verified against standard mathematical axioms. Third, we prove uniqueness: a sigmoid transformer that produces exact posteriors necessarily has BP weights. There is no other path through the sigmoid architecture to exact posteriors. Formally verified against standard mathematical axioms. Fourth, we delineate the AND/OR boolean structure of the transformer layer: attention is AND, the FFN is OR, and their strict alternation is Pearl's gather/update algorithm exactly. Fifth, we confirm all formal results experimentally, corroborating the Bayesian network characterization in practice. We also establish the practical viability of loopy belief propagation despite the current lack of a theoretical convergence guarantee. We further prove that verifiable inference requires a finite concept space. Any finite verification procedure can distinguish at most finitely many concepts. Without grounding, correctness is not defined. Hallucination is not a bug that scaling can fix. It is the structural consequence of operating without concepts. Formally verified against standard mathematical axioms.
研究动机与目标
- 证明 sigmoid transformer 在由其权重定义的隐式因子图上执行信念传播的一轮
- Show that exact belief propagation is achievable with explicitly constructed weights on any factor graph.
- 建立 sigmoid transformer 的后验在精确性方面的权重唯一性
- Characterize the AND/OR boolean structure of transformer layers and their relation to Pearl’s gather/update algorithm.
- Argue that verifiable inference requires a finite, grounded concept space and connect this to hallucination.
提出的方法
- Formally interpret a sigmoid transformer as performing one round of weighted belief propagation per forward pass on an implicit factor graph.
- Construct explicit BP weights and prove that a transformer can implement exact BP on any declared factor graph, with depth scaling to full BP.
- Prove uniqueness: if a sigmoid transformer yields exact posteriors, its weights must be the BP weights (FFN w0=w1=1, b=0 and attention with projectDim/crossProject).
- Show that attention implements the gather step and FFN implements the OR operation, yielding Pearl’s gather/update when stacked.
- Provide empirical verification of BP convergence and exactness on various graph structures and demonstrate finite concept space requirements for verifiable inference.
实验结果
研究问题
- RQ1Can a sigmoid transformer be interpreted as performing Bayesian belief propagation on an implicit factor graph?
- RQ2Can we construct explicit BP-weighted transformers that perform exact BP on any factor graph?
- RQ3Are the BP weights unique for producing exact posteriors in sigmoid transformers?
- RQ4What is the Boolean structure (AND/OR) of transformer layers and how does it relate to Pearl’s algorithm?
- RQ5What role does grounding and a finite concept space play in verifiable inference and hallucination?
主要发现
| Graph | BP exact | Transformer | Max error |
|---|---|---|---|
| 0 | [0.7349, 0.4366] | [0.7338, 0.4346] | 0.0021 |
| 1 | [0.4097, 0.4036] | [0.4096, 0.4031] | 0.0005 |
| 4 | [0.6459, 0.8298] | [0.6436, 0.8297] | 0.0023 |
| 9 | [0.4084, 0.5523] | [0.4084, 0.5526] | 0.0003 |
- A forward pass of any sigmoid transformer implements one round of weighted belief propagation on an implicit factor graph G(W).
- A transformer with explicitly constructed weights can implement one round of exact BP on any factor graph, with depth scaling to full BP given d·⌈log2 k⌉ layers for k-ary factors.
- If a sigmoid transformer produces exact posteriors on a grounded factor graph, its weights are uniquely the BP weights (FFN weights w0=w1=1, b=0; attention as projectDim/crossProject).
- Attention implements the AND-like gather of inputs, and the FFN implements the OR-like update, yielding Pearl’s gather/update across layers.
- There is experimental confirmation: convergence to exact posteriors on loopy graphs and trees, and a finite concept space is required for verifiable inference; hallucination is tied to grounding.
- Empirical results include Table 1 comparing BP exact vs. Transformer posteriors on held-out graphs, showing small max errors across graphs.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。