QUICK REVIEW

[論文レビュー] Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Xiaoming Shi, Shiyu Wang|arXiv (Cornell University)|Sep 24, 2024

Complex Systems and Time Series Analysis被引用数 8

ひとこと要約

Time-MoE は Time-300B で訓練されたデコーダーのみのスパース Mixture-of-Experts 時系列基盤モデルを導入し、柔軟な予測期間と推論コストの削減を実現する普遍的予測を達成し、2.4B のパラメータまでスケールします。

ABSTRACT

Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.

研究の動機と目的

スケーラブルで普遍的な時系列基盤モデルを動機づけ、予測精度と計算効率のバランスを取る。
時系列予測のための疎な Mixture-of-Experts (MoE) トランスフォーマー・アーキテクチャを提案する。
複数ドメインに跨る大規模で高品質な事前学習データセット（Time-300B）を作成する。
ゼロショットおよび分布内ベンチマークを通じて、モデルとデータのスケールの利点を示す。

提案手法

入力トークンの埋め込み、疎な MoE トランスフォーマーブロック、及び多解像度予測ヘッドを備えたデコーダー専用 Time-MoE アーキテクチャを提案する。
FFN 層を、トップ-k ゲーティングでルーティングされる共有エキスパートのプールと、効率と容量を向上させる共有エキスパートに置き換える。
回転位置エンコーディングと安定性・外挿性のための RMSNorm を使用する。
Time-300B（9 ドメインにまたがる 300B の時点）を用い、多解像度予測と補助的なエキスパートバランシング損失を含むマルチタスク目的で事前訓練する。
Time-MoE ultra（総パラメータ 2.4B、活性化パラメータ約 1B）と小型 variants（base 50M、large 200M）を、128 A100 GPU で BF16 を用いて 100k ステップ訓練する。
推論時には、多解像度予測を実現するためのグリーディなスケジュールを適用し、ルーティング崩壊を緩和するための補助バランス損失を用いて自己回帰予測を Huber 損失で最適化する。

Figure 1: Performance overview. ( Left ) Comparison between Time-MoE models and state-of-the-art time series foundation models, reporting the average zero-shot performance across six benchmark datasets. ( Right ) Comparison of few- and zero-shot performance between Time-MoE and dense variants, with

実験結果

リサーチクエスチョン

RQ1Time-MoE は、固定の推論予算の下で、数十億パラメータ規模までスケールして予測精度を維持または向上させることができるか？
RQ2疎な MoE 時系列モデルは、活性化パラメータ数や計算予算が同等の密なモデルと比較して、様々なベンチマークで上回るか？
RQ3Time-300B の大規模事前学習は、ゼロショットと分布内の利得を様々なドメインと horizon で得られるか？
RQ4多解像度予測ヘッドと柔軟なコンテキスト長は、普遍的予測能力にどのように影響するか？
RQ5十億パラメータ規模の安定した訓練には、どのようなデータ品質とクリーニング戦略が不可欠か？

主な発見

Time-MoE は、同じ活性化パラメータ数または予算で密なベースラインよりも予測精度の大幅な利得を達成する。
ベースから ultra へとモデルサイズを拡大すると、ゼロショット設定でのベンチマークにおいて一貫した性能向上をもたらす。
Time-MoE は 6 つの実世界ベンチマークでゼロショットおよび分布内評価において 16 の強力なベースラインを上回り、平均 MSE がゼロショットで約 20%、分布内で約 24% 減少。
Time-MoE は 2.4B パラメータ（約 1B 活性化）へスケールし、疎ルーティングにより推論効率を維持する。
Time-300B は 300B を超える時点数、9 ドメインを含む大規模でオープンアクセスなクロスドメイン事前学習コーパスを提供し、データクリーニングパイプラインを備えて大規模時系列事前学習を可能にする。

Figure 2: The architecture of Time-MoE , which is a decoder-only model. Given an input time series of arbitrary length, 1 we first tokenize it into a sequence of data points, 2 which are then encoded. These tokens are processed through $N$ -stacked backbone layers, primarily consisting of causal mul

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。