QUICK REVIEW

[論文レビュー] MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Ziheng Jiang, Haibin Lin|arXiv (Cornell University)|Feb 23, 2024

Topic Modeling被引用数 24

ひとこと要約

MegaScaleは、10,000以上のGPU上で大規模言語モデルを訓練するための生産システムを提示し、全体的なアルゴリズムとシステムの共設計と深い observability（可観測性）に焦点を当てて、高効率と安定性を実現します。12,288 GPUでの55.2% MFUおよびMegatron-LMに対する1.34x MFU改善を含みます。

ABSTRACT

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

研究の動機と目的

Motivate and enable training of LLMs at extreme scale (tens of thousands of GPUs) for higher model capability and throughput.
Address two core challenges at scale: training efficiency (MFU) and training stability (fault tolerance and reduced stragglers).
Present a full-stack co-design of algorithms and systems spanning model architecture, optimization, data pipeline, and networking to achieve scalable, stable training.
Share practical engineering experiences, diagnostics tools, and lessons learned to inform future LLM systems research.

提案手法

Co-design of algorithmic and system components across model blocks, optimizers, computation/communication overlap, data pipeline, and network tuning.
Use parallel transformer blocks, sliding window attention, and LAMB optimizer to scale and maintain accuracy.
Employ mixed parallelism (data, pipeline, tensor, sequence) with overlap strategies to hide communication costs.
Optimize data pipeline with prefetching and tree-based loading; fuse operations and overlap GEMM/communication.
Develop extensive observability and diagnosis tools, heartbeat-based fault detection, and fast checkpoint/recovery procedures.
Architect network topology and congestion control (ECMP reduction, RTT/ECN-based control) to sustain high throughput at scale.

実験結果

リサーチクエスチョン

RQ1How can LLM training be scaled beyond 10,000 GPUs while maintaining high efficiency and stability?
RQ2What algorithmic and system design choices (parallelism, optimizers, attention mechanisms) maximize MFU and convergence at large scale?
RQ3What observability, fault-tolerance, and fast-recovery techniques are required to operate stable training jobs at extreme scale?
RQ4What concrete engineering practices (data pipeline, network tuning, initialization) most effectively reduce downtime and stragglers?
RQ5How does MegaScale perform in real production runs compared to existing open-source baselines like Megatron-LM?

主な発見

MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B model on 12,288 GPUs.
This MFU represents a 1.34× improvement over Megatron-LM.
MegaScale trains a proprietary hundreds-of-billions-parameter model on multi-trillion tokens for weeks with ongoing convergence and recovery from over 100 faults.
The system demonstrates automatic fault identification, fast recovery, and minimal training interruption via a robust workflow and checkpointing optimizations.
A set of observability tools and diagnostics support fault localization, anomaly detection, and performance analysis in production-scale training.
The authors are open-sourcing components and sharing engineering learnings to inform the community.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。