QUICK REVIEW

[논문 리뷰] Scalable Training of Mixture-of-Experts Models with Megatron Core

Zijie Yan, Hongxiao Bai|arXiv (Cornell University)|2026. 03. 08.

Domain Adaptation and Few-Shot Learning인용 수 0

한 줄 요약

Megatron-Core MoE는 메모리, 통신, 계산을 함께 최적화하여 병렬 접기(parallel folding), 다차원 병렬성, 저정밀 기술을 통해 트릴리언-파라미터 MoE 모델의 확장 가능 학습을 위한 풀 스택의 생산 등급 시스템을 제공합니다. NVIDIA GPU에서 대형 MoE 모델에 대한 최첨단 처리량을 입증합니다.

ABSTRACT

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack. We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves 1,233/1,048 TFLOPS/GPU for DeepSeek-V3-685B and 974/919 TFLOPS/GPU for Qwen3-235B. As a performant, scalable, and production-ready open-source solution, it has been used across academia and industry for training MoE models ranging from billions to trillions of parameters on clusters scaling up to thousands of GPUs. This report explains how these techniques work, their trade-offs, and their interactions at the systems level, providing practical guidance for scaling MoE models with Megatron Core.

연구 동기 및 목표

Motivate and define the scaling challenges of training large-scale Mixture-of-Experts (MoE) models.
Present Megatron-Core MoE architecture and how it integrates with the broader Megatron-Core stack.
Propose multi-dimensional parallelism and Parallel Folding to address the dense-sparse mismatch in MoE training.
Describe memory, communication, and compute optimizations to break the three walls of MoE training.
Provide empirical evaluation and best-practice workflow for scaling MoE models.

제안 방법

Introduce MoE architecture with router, token dispatcher, and expert modules and a four-stage forward pass (Route, Dispatch, Compute, Combine).
Describe parallel group management and a ChainedOptimizer for expert parameter handling across data and expert parallelism groups.
Present Parallel Folding to decouple attention and MoE layer parallelism configurations for flexible mapping.
Detail memory optimizations including fine-grained recomputation, offloading, and reduced-precision FP8/FP4 training.
Explain communication optimizations with DeepEP, HybridEP, and dispatcher backends to hide EP latency.
Describe compute optimizations including Grouped GEMM, kernel fusion, and CUDA Graphs for dropless MoE.
Summarize production features like load balancing, distributed checkpointing, and upcycling from dense checkpoints.

Scalable Training of Mixture-of-Experts Models with Megatron Core

실험 결과

연구 질문

RQ1How can MoE training be scaled to trillion-parameter models without prohibitive memory, communication, or compute costs?
RQ2What architecture and parallelism strategies (including EP and Parallel Folding) best mitigate the dense-sparse mismatch in MoE?
RQ3What memory, communication, and compute optimizations yield the most throughput gains for large MoE models?
RQ4How can reduced-precision training (FP8/FP4) be safely integrated across MoE components to maintain convergence?
RQ5What practical guidelines optimize performance in production-scale MoE training on NVIDIA GPUs?

주요 결과

Achieves high throughput on large MoE models, reporting 1,233/1,048 TFLOPS per GPU for DeepSeek-V3-685B and 974/919 TFLOPS per GPU for Qwen3-235B on GB300/GB200 hardware.
Megatron-Core MoE enables training from billions to trillions of parameters on clusters with thousands of GPUs.
Demonstrates effectiveness of parallel folding and multi-dimensional parallelism to decouple attention and MoE layer mappings.
Shows memory reductions via fine-grained recomputation, activation offloading, and FP8/FP4 precision with selective stability strategies.
Demonstrates full CUDA Graphs coverage and dropless MoE techniques to reduce host overhead and synchronization.
Provides production features (load balancing, token dropping, distributed checkpointing, upcycling) enabling deployment at scale.

Figure 1 : Data flow through an MoE layer: Route, Dispatch, Compute, and Combine stages.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.