[论文解读] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Megatron-LM 展示了层内模型并行,在 PyTorch 中训练拥有数十亿参数的 Transformer 模型,在 512 GPUs 上实现了 8.3B 参数,具备强规模化并在若干 NLP 基准测试上达到最先进的结果。
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training very large transformer models and implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. Our approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model parallelism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. We illustrate this approach by converging transformer based models up to 8.3 billion parameters using 512 GPUs. We sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. To demonstrate that large language models can further advance the state of the art (SOTA), we train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT. We show that careful attention to the placement of layer normalization in BERT-like models is critical to achieving increased performance as the model size grows. Using the GPT-2 model we achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15.8) and LAMBADA (66.5% compared to SOTA accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).
研究动机与目标
- Motivate training of multi-billion parameter language models beyond single-GPU memory limits.
- Develop a simple, efficient intra-layer model parallel approach that fits in PyTorch with minimal changes.
- Evaluate scaling efficiency and performance on GPT-2 and BERT-like architectures.
- Demonstrate state-of-the-art results on language modeling and downstream tasks while releasing open-source code.
提出的方法
- 实现层内模型并行,通过在 GPU 之间对 MLP 和自注意力中的 GEMMs 进行分区来实现最小化同步开销的并行化.
- 每个 Transformer 层进行两次前向和两次反向的全量通信以实现高效的尺度化.
- 在词汇维度上对输入和输出嵌入矩阵进行并行化,以减少跨 GPU 的通信.
- 在每个 GPU 上维持重复的层归一化与残差计算,以避免额外通信.
- 以混合精度训练 Transformer 模型,结合动态损失缩放和激活检查点以提升内存效率。
实验结果
研究问题
- RQ1在不使用自定义编译器的情况下,PyTorch 的层内模型并行是否能将 Transformer 模型扩展到多十亿参数?
- RQ2当参数扩展到十亿级别时,模型大小对 GPT-2 与 BERT 类模型在标准 NLP 基准上的性能影响如何?
- RQ3为在模型增长时保持或提升性能,需要哪些架构调整(如层归一化的放置位置)?
- RQ4在数百个 GPU 上训练到数十亿参数时,实际的可扩展性极限(FLOPs、吞吐率、效率)是什么?
主要发现
- 通过 8 路模型并行在 512 GPU 上训练高达 8.3B 参数,持续达到多达 15.1 PetaFLOPs,并相对于强基线单 GPU(39 TeraFLOPs)实现 76% 的扩展效率。
- 模型并行在模型仅训练与模型+数据两种配置下均呈现出强健的弱扩展性,最大的设置达到 74%–77% 的扩展效率。
- 在 BERT 类模型中,层归一化的谨慎放置使得模型大小增加时性能呈现单调提升。
- GPT-2 风格的模型在 WikiText103 上达到最先进的困惑度(10.81),并在 8.3B 参数下获得强的 LAMBADA 精度(66.51%);RACE 精度也随着更大规模的 BERT/Megatron 设置而提升至 90.9%。
- 使用所提出的架构和训练方案,BERT 风格的模型达到在若干 GLUE 类任务和 RACE 上的最先进开发结果。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。