QUICK REVIEW

[论文解读] Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting

Yong Liu, Haixu Wu|arXiv (Cornell University)|May 28, 2022

Time Series Analysis and Forecasting被引用 261

一句话总结

本文提出带有序列平稳化和去平稳化注意力的非平稳变换器，以在可预测性与非平稳性之间取得平衡，在多个 Transformer 变体的六个真实世界基准上取得最先进的结果。

ABSTRACT

Transformers have shown great power in time series forecasting due to their global-range modeling ability. However, their performance can degenerate terribly on non-stationary real-world data in which the joint distribution changes over time. Previous studies primarily adopt stationarization to attenuate the non-stationarity of original series for better predictability. But the stationarized series deprived of inherent non-stationarity can be less instructive for real-world bursty events forecasting. This problem, termed over-stationarization in this paper, leads Transformers to generate indistinguishable temporal attentions for different series and impedes the predictive capability of deep models. To tackle the dilemma between series predictability and model capability, we propose Non-stationary Transformers as a generic framework with two interdependent modules: Series Stationarization and De-stationary Attention. Concretely, Series Stationarization unifies the statistics of each input and converts the output with restored statistics for better predictability. To address the over-stationarization problem, De-stationary Attention is devised to recover the intrinsic non-stationary information into temporal dependencies by approximating distinguishable attentions learned from raw series. Our Non-stationary Transformers framework consistently boosts mainstream Transformers by a large margin, which reduces MSE by 49.43% on Transformer, 47.34% on Informer, and 46.89% on Reformer, making them the state-of-the-art in time series forecasting. Code is available at this repository: https://github.com/thuml/Nonstationary_Transformers.

研究动机与目标

Argue that direct stationarization can cause over-stationarization, limiting Transformer capability on non-stationary data.
Introduce a generic framework combining Series Stationarization with De-stationary Attention.
Show that the framework boosts Transformer-based models across multiple real-world datasets.
Demonstrate strong empirical gains and broad compatibility with existing Transformer variants.

提出的方法

Series Stationarization normalizes each input time series via a sliding-window normalization and de-normalizes outputs to restore original statistics.
De-stationary Attention learns non-stationary factors (tau and Delta) from raw series statistics to re-incorporate non-stationarity into attention, via an MLP projector.
Self-attention inputs are transformed as Q', K', V' from stationarized data, with de-stationary factors applied to recover non-stationary dependencies (Equation 6).
The framework wraps a base Transformer (Encoder-Decoder) and replaces standard Attention with De-stationary Attention, preserving efficiency.
The approach is compatible with Transformer variants (e.g., Transformer, Informer, Reformer, Autoformer) with minor modifications to attention terms (Appendix references).

实验结果

研究问题

RQ1Can stationarization improve short-term forecastability without losing essential non-stationary signals?
RQ2Can a lightweight De-stationary Attention mechanism recover non-stationary information lost during stationarization?
RQ3Do the proposed modules generalize across multiple Transformer architectures and real-world datasets?
RQ4What is the empirical impact of the framework on non-stationary time series forecasting across diverse domains?

主要发现

The framework consistently improves baseline Transformers on six real-world benchmarks across multiple forecast horizons.
On highly non-stationary data, the method achieves substantial MSE reductions (e.g., ~49% for Transformer, ~47% for Informer, ~47% for Reformer in reported results).
Series Stationarization aligns statistical properties across input series, while De-stationary Attention reintroduces intrinsic non-stationarity to capture eventful temporal dependencies.
Across four mainstream Transformers, the framework yields large average performance gains (e.g., avg MSE promotion: Transformer ~49.43%, Informer ~47.34%, Reformer ~46.89%, Autoformer ~10.57%).
The De-stationary Attention component significantly mitigates over-stationarization, producing predictions closer to ground-truth non-stationary dynamics.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。