QUICK REVIEW

[論文レビュー] Transformer Is Inherently a Causal Learner

Xinyue Wang, Stephen Wang|arXiv (Cornell University)|Jan 9, 2026

Time Series Analysis and Forecasting被引用数 0

ひとこと要約

Decoder-only transformers trained for autoregressive forecasting can recover the true lagged causal graph from data via aggregated gradient attributions, without explicit causal objectives.

ABSTRACT

We reveal that transformers trained in an autoregressive manner naturally encode time-delayed causal structures in their learned representations. When predicting future values in multivariate time series, the gradient sensitivities of transformer outputs with respect to past inputs directly recover the underlying causal graph, without any explicit causal objectives or structural constraints. We prove this connection theoretically under standard identifiability conditions and develop a practical extraction method using aggregated gradient attributions. On challenging cases such as nonlinear dynamics, long-term dependencies, and non-stationary systems, this approach greatly surpasses the performance of state-of-the-art discovery algorithms, especially as data heterogeneity increases, exhibiting scaling potential where causal accuracy improves with data volume and heterogeneity, a property traditional methods lack. This unifying view lays the groundwork for a future paradigm where causal discovery operates through the lens of foundation models, and foundation models gain interpretability and enhancement through the lens of causality.

研究の動機と目的

Motivate causal discovery in high-dimensional, nonlinear, and non-stationary time-series settings.
Show that decoder-only transformers trained for forecasting identify lagged causal structure under standard identifiability assumptions.
Propose a practical gradient-energy readout using Layer-wise Relevance Propagation (LRP) to extract causal graphs.
Demonstrate scalability and robustness of the approach across nonlinear, long-range, and heterogeneous dynamics.
Discuss how this enables a foundation-model–driven paradigm for scalable causal discovery and interpretability.

提案手法

Model a p-variate time series with lag window L and direct causal parents Pa(i,t).
Train a decoder-only transformer to predict X_t from X_{t-1},…,X_{t-L} with autoregressive masking.
Compute gradient-energy-based attributions H_{j,i}^{ extell} or its Gaussian specialization G_{j,i}^{ extell} to identify edges j -> i at lag ell.
Approximate G with aggregated Layer-wise Relevance Propagation (LRP) readouts ϑR_{ij}^{(ell)} and clamp to a sparse graph via binarization.
Binarize edges using Top-k per target or a uniform-threshold rule to obtain a causal graph.
Justify using gradients rather than raw attention due to token mixing in deep transformers.

Figure 1: Data generation and transformer-based causal discovery. Left: A decoder-only transformer trained for next-step prediction. Tokens are lagged observations from $t\!-\!L$ to $t\!-\!1$ ; the model predicts $X_{t}$ from $X_{t-1:t-L}$ . Right: A lagged data-generating process with $N\!=\!3$ and

実験結果

リサーチクエスチョン

RQ1Can decoder-only transformers trained for forecasting identify the true lagged causal graph under standard identifiability assumptions (A1–A4) in time-series data?
RQ2Do gradient-based attributions (via LRP) reliably recover causal structure better than attention-based or traditional methods across nonlinear, long-range, and non-stationary dynamics?
RQ3How does data volume, heterogeneity, and model depth affect causal discovery performance in this framework?
RQ4Can the learned lagged causal structure be enhanced or refined when latent confounders or instantaneous effects are present?
RQ5What practical strategies (domain indicators, post-processing pipelines) improve data efficiency and robustness of causal discovery with transformers?

主な発見

Decoding through gradients yields unique identifiability of lagged causal parents under A1–A4 and regularity conditions.
Gradient-energy readouts via LRP effectively recover the true causal graph and outperform baselines in nonlinear, high-dimensional, and long-range settings.
Transformer-based discovery improves with more data and shows data-heterogeneity scaling, surpassing state-of-the-art methods in challenging regimes.
Contextualized attention enables multiple dynamic dependencies to be captured within a single model, accommodating non-stationarity without fixed static masks.
Post-processing with latent-aware or domain-enhanced steps can mitigate issues from latent confounders or instantaneous effects, improving robustness.
Deeper transformers and gradient-based readouts yield more accurate causal structures than shallow variants or attention-based explanations.

Figure 2: F1 score analysis across regimes. (A) Mean F1 across all experiments (averages exclude timeout cases). (B) High-dimensional input: F1 averaged across scales and seeds vs. the number of nodes. (C) Long-range dependencies: F1 averaged across scales and seeds vs. maximum lag. (D) Nonlinearity

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。