QUICK REVIEW

[논문 리뷰] Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

Xiaoming Shi, Shiyu Wang|arXiv (Cornell University)|2024. 09. 24.

Complex Systems and Time Series Analysis인용 수 8

한 줄 요약

Time-MoE는 디코더 전용의 희소 Mixture-of-Experts 시계열 기초 모델을 Time-300B로 학습시켜 유연한 시계열 예측 비용을 감소시키면서, 2.4B 파라미터까지 확장 가능한 보편 예측을 달성합니다.

ABSTRACT

Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.

연구 동기 및 목표

예측 정확도와 계산 효율성의 균형을 맞추며 확장 가능한 보편적 시계열 기본 모델을 동기부여한다.
시간 시계열 예측을 위한 희소 Mixture-of-Experts (MoE) 트랜스포머 아키텍처를 제안한다.
다양한 도메인에 걸친 대규모 고품질 사전 학습 데이터셋(Time-300B)을 생성한다.
제로샷 및 인디스트리뷰션 벤치마크를 통해 모델 및 데이터 규모의 이점을 입증한다.

제안 방법

입력 토큰 임베딩, 희소 MoE 트랜스포머 블록, 다중 해상도 예측 헤드를 갖춘 디코더-전용 Time-MoE 아키텍처를 제안한다.
FFN 계층을 상호 공유되는 전문가 풀과 top-k 게이팅 및 공유된 전문가로 대체해 효율성과 용량을 개선한다.
회전 위치 임베딩(rotary positional embeddings)과 안정성과 외삽을 위한 RMSNorm을 사용한다.
Time-300B(9개 도메인에 걸친 300B 시점)에서 다중 작업 목표와 다중 해상도 예측, 보조 전문가 밸런싱 손실을 사용하여 사전 학습한다.
Time-MoE ultra(총 2.4B 파라미터, 약 1B 활성화) 및 더 작은 변형(base 50M, large 200M)을 128 A100 GPU에서 BF16으로 100k 스텝 학습.
루팅 붕괴를 완화하기 위해 Auto-regressive 예측에 대해 Huber 손실과 보조 밸런스 손실로 최적화; 추론 중 다중 해상도 예측에 대해 탐욕스케줄 적용.

Figure 1: Performance overview. ( Left ) Comparison between Time-MoE models and state-of-the-art time series foundation models, reporting the average zero-shot performance across six benchmark datasets. ( Right ) Comparison of few- and zero-shot performance between Time-MoE and dense variants, with

실험 결과

연구 질문

RQ1Time-MoE가 수십억 파라미터로 확장되면서 예측 정확도를 유지하거나 향상시키면서 고정된 추론 예산 하에서 가능합니까?
RQ2활성화 파라미터 수나 계산 예산이 유사한 희소 MoE 시계열 모델이 다층 대조와 비교했을 때 밀집 등가물보다 우수합니까?
RQ3Time-300B에서의 대규모 사전 학습이 다양한 도메인과 수평선에서 제로샷 및 인디스트리뷰션 이득을 가져오나요?
RQ4다중 해상도 예측 헤드와 유연한 컨텍스트 길이가 보편적 예측 능력에 어떤 영향을 미칩니까?
RQ5십억 파라미터 시계열 모델의 안정적인 학습에 필요한 데이터 품질 및 정리 전략은 무엇입니까?

주요 결과

Time-MoE는 동일한 활성 파라미터 수나 예산으로 Dense 기초 모델 대비 예측 정확도에서 상당한 이점을 달성한다.
base에서 ultra로 모델 크기를 키우면 제로샷 설정에서 벤치마크 전반에 걸쳐 일관된 성능 향상이 나타난다.
Time-MoE 모델은 제로샷 및 분포 내 평가에서 여섯 개의 실제 벤치마크 중 16개 강력한 기초모델을 능가하며, 평균 MSE 감소는 약 20%(제로샷) 및 24%(분포 내)이다.
Time-MoE는 2.4B 파라미터까지 확장되며(약 1B 활성화) 희소 라우팅으로 인해 추론 효율을 유지한다.
Time-300B는 대규모의 개방형 접근 가능하고 교차 도메인 사전 학습 코퍼스(300B 이상 시점; 9 도메인)로 데이터 정제 파이프라인을 통해 대규모 시계열 사전 학습을 가능하게 한다.

Figure 2: The architecture of Time-MoE , which is a decoder-only model. Given an input time series of arbitrary length, 1 we first tokenize it into a sequence of data points, 2 which are then encoded. These tokens are processed through $N$ -stacked backbone layers, primarily consisting of causal mul

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.