QUICK REVIEW

[论文解读] Transformer Is Inherently a Causal Learner

Xinyue Wang, Stephen Wang|arXiv (Cornell University)|Jan 9, 2026

Time Series Analysis and Forecasting被引用 0

一句话总结

解码器仅 Transformer 经过自回归预测训练，可以通过聚合梯度归因从数据中恢复真实的滞后因果图，而无需显式因果目标。

ABSTRACT

We reveal that transformers trained in an autoregressive manner naturally encode time-delayed causal structures in their learned representations. When predicting future values in multivariate time series, the gradient sensitivities of transformer outputs with respect to past inputs directly recover the underlying causal graph, without any explicit causal objectives or structural constraints. We prove this connection theoretically under standard identifiability conditions and develop a practical extraction method using aggregated gradient attributions. On challenging cases such as nonlinear dynamics, long-term dependencies, and non-stationary systems, this approach greatly surpasses the performance of state-of-the-art discovery algorithms, especially as data heterogeneity increases, exhibiting scaling potential where causal accuracy improves with data volume and heterogeneity, a property traditional methods lack. This unifying view lays the groundwork for a future paradigm where causal discovery operates through the lens of foundation models, and foundation models gain interpretability and enhancement through the lens of causality.

研究动机与目标

在高维、非线性、非平稳时间序列设定中激发因果发现的动机。
证明仅有解码器的 Transformer 在预测训练下，在标准可辨识性假设下能够识别滞后因果结构。
提出一种基于梯度能量读取的实用方法，使用 Layer-wise Relevance Propagation (LRP) 提取因果图。
展示该方法在非线性、长程和异质动力学下的可扩展性与鲁棒性。
讨论这如何为可扩展因果发现与可解释性提供一个以基础模型为驱动的范式。

提出的方法

用滞后窗口 L 与直接因果父节点 Pa(i,t) 来建模 p 变量的时间序列。
训练一个解码器仅 Transformer，通过自回归掩蔽从 X_{t-1},…,X_{t-L} 预测 X_t。
计算梯度能量基的归因 H_{j,i}^{ extell} 或其高斯特化 G_{j,i}^{ extell}，以在滞后 ell 处识别边 j -> i。
用聚合的 Layer-wise Relevance Propagation (LRP) 读取 R_{ij}^{(ell)} 来近似 G，并通过二值化截断为稀疏图。
使用每个目标的 Top-k 或统一阈值规则对边进行二值化，以获得因果图。
用梯度而非原始注意力来进行理由化，因为深层 Transformer 中的令牌混合。

Figure 1: Data generation and transformer-based causal discovery. Left: A decoder-only transformer trained for next-step prediction. Tokens are lagged observations from $t\!-\!L$ to $t\!-\!1$ ; the model predicts $X_{t}$ from $X_{t-1:t-L}$ . Right: A lagged data-generating process with $N\!=\!3$ and

实验结果

研究问题

RQ1在时间序列数据中，是否能在标准可辨识性假设 (A1–A4) 下，训练用于预测的解码器仅 Transformer 识别出真实的滞后因果图？
RQ2基于梯度的归因（通过 LRP）是否在非线性、长程、非平稳动力学中比注意力或传统方法更可靠地恢复因果结构？
RQ3数据量、异质性和模型深度如何影响该框架中的因果发现性能？
RQ4当存在潜在混淆因子或瞬时效应时，学习到的滞后因果结构是否可以得到增强或细化？
RQ5哪些实际策略（领域指示符、后处理管线）可以提升 Transformer 在因果发现中的数据效率和鲁棒性？

主要发现

通过梯度解码能够在 A1–A4 与正则性条件下实现滞后因果父节点的唯一可辨识性。
通过 LR P 的梯度能量读取能有效恢复真实因果图，在非线性、高维和长程设置中优于基线。
基于 Transformer 的发现随着数据量增加而提升，显示出数据异质性扩展性，在挑战性场景中超过现有方法。
上下文化注意力使单一模型能够捕捉多种动态依赖关系，适应非平稳性而无需固定静态掩码。
带潜在因素感知或领域增强的后处理步骤可以缓解来自潜在混淆因子或瞬时效应的问题，提升鲁棒性。
更深的 Transformer 与基于梯度的读取比浅层变体或基于注意力的解释产生更准确的因果结构。

Figure 2: F1 score analysis across regimes. (A) Mean F1 across all experiments (averages exclude timeout cases). (B) High-dimensional input: F1 averaged across scales and seeds vs. the number of nodes. (C) Long-range dependencies: F1 averaged across scales and seeds vs. maximum lag. (D) Nonlinearity

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。