QUICK REVIEW

[論文レビュー] Scalable Training of Mixture-of-Experts Models with Megatron Core

Zijie Yan, Hongxiao Bai|arXiv (Cornell University)|Mar 8, 2026

Domain Adaptation and Few-Shot Learning被引用数 0

ひとこと要約

Megatron-Core MoE は、パラレルフォールディング、多次元並列性、低精度技術を通じてメモリ、通信、計算を共同最適化することで、兆レベルのMoEモデルの訓練をスケーラブルにするフルスタックの生産レディなシステムを提供します。NVIDIA GPU上で大規模MoEモデルの最先端スループットを実証します。

ABSTRACT

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack. We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs. This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.

研究の動機と目的

Motivate and define the scaling challenges of training large-scale Mixture-of-Experts (MoE) models.
Present Megatron-Core MoE architecture and how it integrates with the broader Megatron-Core stack.
Propose multi-dimensional parallelism and Parallel Folding to address the dense-sparse mismatch in MoE training.
Describe memory, communication, and compute optimizations to break the three walls of MoE training.
Provide empirical evaluation and best-practice workflow for scaling MoE models.

提案手法

Introduce MoE architecture with router, token dispatcher, and expert modules and a four-stage forward pass (Route, Dispatch, Compute, Combine).
Describe parallel group management and a ChainedOptimizer for expert parameter handling across data and expert parallelism groups.
Present Parallel Folding to decouple attention and MoE layer parallelism configurations for flexible mapping.
Detail memory optimizations including fine-grained recomputation, offloading, and reduced-precision FP8/FP4 training.
Explain communication optimizations with DeepEP, HybridEP, and dispatcher backends to hide EP latency.
Describe compute optimizations including Grouped GEMM, kernel fusion, and CUDA Graphs for dropless MoE.
Summarize production features like load balancing, distributed checkpointing, and upcycling from dense checkpoints.

Scalable Training of Mixture-of-Experts Models with Megatron Core

実験結果

リサーチクエスチョン

RQ1How can MoE training be scaled to trillion-parameter models without prohibitive memory, communication, or compute costs?
RQ2What architecture and parallelism strategies (including EP and Parallel Folding) best mitigate the dense-sparse mismatch in MoE?
RQ3What memory, communication, and compute optimizations yield the most throughput gains for large MoE models?
RQ4How can reduced-precision training (FP8/FP4) be safely integrated across MoE components to maintain convergence?
RQ5What practical guidelines optimize performance in production-scale MoE training on NVIDIA GPUs?

主な発見

Achieves high throughput on large MoE models, reporting 1,233/1,048 TFLOPS per GPU for DeepSeek-V3-685B and 974/919 TFLOPS per GPU for Qwen3-235B on GB300/GB200 hardware.
Megatron-Core MoE enables training from billions to trillions of parameters on clusters with thousands of GPUs.
Demonstrates effectiveness of parallel folding and multi-dimensional parallelism to decouple attention and MoE layer mappings.
Shows memory reductions via fine-grained recomputation, activation offloading, and FP8/FP4 precision with selective stability strategies.
Demonstrates full CUDA Graphs coverage and dropless MoE techniques to reduce host overhead and synchronization.
Provides production features (load balancing, token dropping, distributed checkpointing, upcycling) enabling deployment at scale.

Figure 1 : Data flow through an MoE layer: Route, Dispatch, Compute, and Combine stages.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。