[论文解读] Seeking SOTA: Time-Series Forecasting Must Adopt Taxonomy-Specific Evaluation to Dispel Illusory Gains
本文认为当前长序列时间序列基准强调周期性模式,这使复杂模型优于简单基线;呼吁进行特定 taxonomy 的评估并建立稳健的经典基线以揭示真实进展。
We argue that the current practice of evaluating AI/ML time-series forecasting models, predominantly on benchmarks characterized by strong, persistent periodicities and seasonalities, obscures real progress by overlooking the performance of efficient classical methods. We demonstrate that these "standard" datasets often exhibit dominant autocorrelation patterns and seasonal cycles that can be effectively captured by simpler linear or statistical models, rendering complex deep learning architectures frequently no more performant than their classical counterparts for these specific data characteristics, and raising questions as to whether any marginal improvements justify the significant increase in computational overhead and model complexity. We call on the community to (I) retire or substantially augment current benchmarks with datasets exhibiting a wider spectrum of non-stationarities, such as structural breaks, time-varying volatility, and concept drift, and less predictable dynamics drawn from diverse real-world domains, and (II) require every deep learning submission to include robust classical and simple baselines, appropriately chosen for the specific characteristics of the downstream tasks' time series. By doing so, we will help ensure that reported gains reflect genuine scientific methodological advances rather than artifacts of benchmark selection favoring models adept at learning repetitive patterns.
研究动机与目标
- 强调标准 LTSF 基准被强周期性所主导,这些周期性可以被简单模型捕捉。
- 主张在基准中退役或增补来自不同领域的非平稳性和非周期性动力学。
- 倡导每个深度学习 TSF 提交包含针对数据特征的稳健经典基线。
- 呼吁以反映真实方法进展而非基准工件收益为目标的评估 Protocol。
提出的方法
- 对现有 LTSF 基准数据集与评估实践的回顾与批评。
- 分析强周期性如何使简单或轻量化模型能够 rival 或超越复杂的 transformer。
- 讨论统计原理(如 Stein’s paradox)以说明跨异质时间序列的聚合效应。
- 提出 taxonomy-specific 评估与在提交中常规纳入廉价经典基线的方案。
实验结果
研究问题
- RQ1当前 LTSF 基准是否反映现实世界的非平稳性和超越周期数据的模型泛化?
- RQ2简单经典模型是否能在当前 LTSF 基准上与最先进的深度学习方法在各领域竞争?
- RQ3哪些评估实践能确保所报道的收益代表真正的方法学进展,而非基准工件?
- RQ4应如何重新设计 TSF 基准以纳入 taxonomy-specific 的非平稳性并设定公平基线?
主要发现
- 标准 LTSF 数据集显示出强烈、持久的周期性,这些周期性可被线性或统计模型有效捕捉。
- 在这些数据集上,许多深度学习 TSF 模型对经典基线的改进仅为边际,质疑增加的复杂性的价值。
- 如 LTSF-Linear 这样的简单模型在九个标准 LTSF 数据集上可超越最先进的 transformers,凸显基准驱动的进展错觉。
- 对异质序列的聚合指标可能偏向全球泛化者而非任务专家,掩盖对单个任务的差强人意表现。
- 需要覆盖更广泛非平稳性的基准,如结构性断裂、时变波动、概念漂移,并要求提交中具备稳健基线。
- 评估指标需谨慎选择,报告多种指标并意识到其局限性,以避免排名偏差。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。