QUICK REVIEW

[论文解读] Efficient Large-Scale Language Model Training on GPU Clusters

Deepak Narayanan, Mohammad Shoeybi|arXiv (Cornell University)|Apr 9, 2021

Topic Modeling参考文献 25被引用 31

一句话总结

本文提出一种混合并行框架，结合张量并行、流水线并行和数据并行，以在最多3072块GPU的GPU集群上高效训练万亿参数语言模型。通过引入交错流水线调度并优化通信与内存使用，该方法实现了52%的峰值GPU吞吐量——比之前的方法高出10%——从而在3072块GPU上实现502 petaFLOP/s的训练速度，并显著提升了可扩展性。

ABSTRACT

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and b) the number of compute operations required to train these models can result in unrealistically long training times. New methods of model parallelism such as tensor and pipeline parallelism have been proposed to address these challenges. Unfortunately, naive usage leads to fundamental scaling issues at thousands of GPUs due to various reasons, e.g., expensive cross-node communication or idle periods waiting on other devices. In this work, we show how to compose different types of parallelism methods (tensor, pipeline, and data parallelism) to scale to thousands of GPUs, achieving a two-order-of-magnitude increase in the sizes of models we can efficiently train compared to existing systems. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by more than 10% with comparable memory footprint compared to previously-proposed approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of peak; previous efforts to train similar-sized models achieve much lower throughput (36% of theoretical peak). Our code is open sourced at this https URL.

研究动机与目标

解决在GPU集群上训练万亿参数语言模型所面临的GPU内存受限和训练时间过长的挑战。
克服在使用数千块GPU时现有并行方法的可扩展性限制，特别是由高通信开销和设备空闲时间导致的问题。
开发一种可扩展、高效的训练系统，结合张量并行、流水线并行和数据并行以实现最佳性能与内存使用。
通过一种新颖的交错流水线并行调度，提升大规模模型训练的吞吐量并减少空闲时间。

提出的方法

将张量并行、流水线并行和数据并行整合为统一的训练框架，以在数千块GPU上分布模型参数和激活值。
引入一种交错流水线并行调度，使流水线阶段之间的前向和反向传播重叠执行，从而减少空闲时间并提升吞吐量。
优化节点之间及集群内部的通信模式，以最小化延迟并最大化带宽利用率。
通过混合精度训练和梯度检查点技术，战略性地划分模型层和数据批次，以平衡设备上的内存与计算负载。
采用混合模型并行策略，根据模型大小和硬件约束动态分配张量并行和流水线并行。
实现系统级训练循环，实时监控设备利用率并调整调度策略，以确保所有节点均保持高GPU吞吐量。

实验结果

研究问题

RQ1如何有效组合张量并行、流水线并行和数据并行，以将语言模型训练扩展至3072块GPU？
RQ2何种调度策略能最小化空闲时间并最大化大规模流水线并行训练中的吞吐量？
RQ3与以往方法相比，所提出的交错流水线调度在内存使用和吞吐量方面表现如何？
RQ4在通信开销、内存占用和计算效率方面，不同并行策略之间的权衡关系是什么？
RQ5该系统在训练万亿参数模型时，能在多大程度上实现高GPU利用率和理论峰值性能？

主要发现

所提出的混合并行框架可在3072块GPU上以502 petaFLOP/s的速度训练1万亿参数语言模型。
系统实现了每块GPU 52%的峰值性能吞吐量，相比以往方法在内存使用相当的情况下提升了10%。
与非交错式方案相比，交错流水线并行调度将空闲时间减少并使吞吐量提升超过10%。
该框架能高效扩展至数千块GPU，克服了跨节点通信瓶颈和设备同步延迟等根本性可扩展性问题。
定量分析表明，结合张量并行、流水线并行和数据并行在内存、通信和计算之间的权衡优于单独使用任一方法。
开源实现支持可复现性，并推动在通用GPU集群上大规模模型训练的进一步研究。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。