Skip to main content
QUICK REVIEW

[논문 리뷰] Scalable and Efficient MoE Training for Multitask Multilingual Models

Young Jin Kim, Ammar Ahmad Awan|arXiv (Cornell University)|2021. 09. 22.
Topic Modeling참고 문헌 26인용 수 33
한 줄 요약

이 논문은 대규모 Mixture-of-Experts 모델을 다중작업 다국어 설정에서 확장 가능하게 학습하기 위한 DeepSpeed MoE를 제시하고, RTS, AoE, pruning 등의 학습 기법을 통해 Z-code M3 최대 10B 매개변수의 강력한 MT 및 다국어 생성 결과를 보여준다.

ABSTRACT

The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.

연구 동기 및 목표

  • Enable scalable training of multitask multilingual Mixture-of-Experts models reaching billions/trillions of parameters.
  • Develop training and system techniques to improve sample efficiency and runtime efficiency.
  • Showcase practical models (Z-code M3) achieving strong MT and multilingual generation across 50 languages.

제안 방법

  • Develop DeepSpeed MoE system enabling five parallelism forms (data, expert, model, tensor-slicing, ZeRO) and ZeRO-Offload to exceed GPU memory limits.
  • Introduce Random Token Selection to mitigate token-bias in MoE routing.
  • Propose Aggregation of Experts (AoE) combining checkpoints to create larger expert pools for initialization and training.
  • Explore experts pruning strategies (random and utilization-based) for faster inference.
  • Train multitask multilingual models (MT, DAE, ELECTRA, MLM) within a single objective by summing task losses.
  • Use transformer encoder-decoder architecture with MoE layers placed every other layer and a 250k SentencePiece vocabulary.

실험 결과

연구 질문

  • RQ1How can MoE architectures be scaled to trillions of parameters for multitask multilingual training?
  • RQ2Can DeepSpeed MoE overcome GPU memory limits to enable larger base models and more experts?
  • RQ3Do MoE-based multitask multilingual training regimes improve downstream MT and NLG tasks compared to dense baselines?
  • RQ4What training techniques maximize sample efficiency and inference efficiency in large MoE models?
  • RQ5What is the impact of multitask objectives on multilingual translation and generation quality?

주요 결과

  • DeepSpeed MoE enables near-linear throughput scaling across GPUs and supports model sizes beyond GPU memory via ZeRO-Offload.
  • RTS reduces token-position bias and improves convergence speed and regularization in MoE training.
  • AoE allows creating larger effective expert pools by aggregating parameters from checkpoints to initialize larger models.
  • Experts pruning yields smaller, efficient inference without sacrificing much performance, via random or utilization-based selection.
  • Z-code M3 (10B parameters, 64 experts) outperforms dense baselines and smaller MoE configurations on MT and multilingual tasks, demonstrating strong multitask multilingual capabilities.
  • Fine-tuned Z-code M3 models achieve notable improvements on downstream tasks such as Wikilingua and cross-lingual generation.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.