[论文解读] Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
Time-MoE 引入一个解码器端的稀疏 Mixture-of-Experts(MoE)时间序列基础模型,在 Time-300B 上训练,以实现具备灵活的预测区间的通用预测并降低推理成本,参数规模可扩展至 2.4B。
Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.
研究动机与目标
- 推动可扩展、通用的时间序列基础模型,在预测准确性与计算效率之间取得平衡。
- 提出一种用于时间序列预测的稀疏 Mixture-of-Experts (MoE) 转换器架构。
- 创建一个覆盖多个领域的大规模高质量预训练数据集(Time-300B)。
- 通过零-shot 和分布内基准测试展示模型与数据规模带来的收益。
提出的方法
- 提出一个仅解码器的 Time-MoE 架构,包含输入令牌嵌入、稀疏 MoE 转换器块和多分辨率预测头。
- 用一个共享的专家池通过 top-k 门控进行路由,并使用一个共享专家替代前馈网络层,以提高效率与容量。
- 使用旋转位置嵌入和 RMSNorm 以保证稳定性与外推能力。
- 在 Time-300B(跨 9 个领域的 300B 时间点)上对 Time-MoE 进行预训练,使用带有多分辨率预测的多任务目标,以及辅助的专家平衡损失。
- 用 128 张 A100 GPU,采用 BF16,进行 100k 步训练 Time-MoE Ultra(总参数 2.4B,约 1B 激活)以及较小变体(base 50M,large 200M)。
- 使用 Huber 损失对自回归预测进行优化,并辅以辅助平衡损失以缓解路由崩溃;在推理阶段对多分辨率预测采用贪婪调度。

实验结果
研究问题
- RQ1在固定推理预算下,Time-MoE 是否能将时间序列基础模型扩展到十亿参数级别,同时保持或提升预测准确性?
- RQ2在不同基准上,稀疏 MoE 时间序列模型是否优于具有相似激活参数数量或计算预算的密集等价模型?
- RQ3在 Time-300B 上的大规模预训练是否在多领域与不同预测区间上带来零-shot 和分布内的收益?
- RQ4多分辨率预测头和灵活的上下文长度如何影响通用预测能力?
- RQ5对于亿级参数时间序列模型,哪些数据质量与清洗策略对稳定训练至关重要?
主要发现
- Time-MoE 在相同激活参数或预算下相较于密集基线获得显著的预测准确性提升。
- 从 base 到 ultra 增大模型规模,在零-shot 设置下在各基准上获得持续的性能提升。
- Time-MoE 模型在六个真实世界基准的零-shot 和分布内评估中超过 16 个强基线,平均 MSE 下降约 20%(零-shot)和 24%(分布内)。
- Time-MoE 可扩展至 2.4B 参数(约 1B 激活),并因稀疏路由而保持推理高效。
- Time-300B 提供一个大型、开放获取、跨领域的预训练语料库(超过 300B 时间点;9 个领域),并配有数据清洗流水线,使大规模时间序列预训练成为可能。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。