QUICK REVIEW

[论文解读] Maximizing Parallelism in Distributed Training for Huge Neural Networks

Zhengda Bian, Qifan Xu|arXiv (Cornell University)|May 30, 2021

Advanced Neural Network Applications参考文献 14被引用 20

一句话总结

该论文提出了一种新颖的三维模型并行技术，用于训练超大神经网络，特别是Transformer语言模型，通过在张量、序列和头三个维度上分布线性层计算，实现完美的负载均衡。该方法降低了内存和通信开销，在64块V100 GPU上，相较于一维和二维并行分别实现了2.32倍和1.57倍的加速。

ABSTRACT

The recent Natural Language Processing techniques have been refreshing the state-of-the-art performance at an incredible speed. Training huge language models is therefore an imperative demand in both industry and academy. However, huge language models impose challenges to both hardware and software. Graphical processing units (GPUs) are iterated frequently to meet the exploding demand, and a variety of ASICs like TPUs are spawned. However, there is still a tension between the fast growth of the extremely huge models and the fact that Moore's law is approaching the end. To this end, many model parallelism techniques are proposed to distribute the model parameters to multiple devices, so as to alleviate the tension on both memory and computation. Our work is the first to introduce a 3-dimensional model parallelism for expediting huge language models. By reaching a perfect load balance, our approach presents smaller memory and communication cost than existing state-of-the-art 1-D and 2-D model parallelism. Our experiments on 64 TACC's V100 GPUs show that our 3-D parallelism outperforms the 1-D and 2-D parallelism with 2.32x and 1.57x speedup, respectively.

研究动机与目标

为解决日益增长的训练超大语言模型的挑战，这些模型已超出单个设备的内存和计算容量。
通过引入线性层的三维分解，减少分布式模型并行中的通信和内存开销。
在超大Transformer模型的分布式训练中实现GPU间的完美负载均衡。
在弱缩放和强缩放场景下，均优于现有的单维和二维模型并行技术。

提出的方法

提出一种三维模型并行算法，将线性层沿张量、序列和头维度进行分解，以平衡计算和通信负载。
采用负载均衡的数据布局，最大限度减少空闲时间，提升所有设备的GPU利用率。
应用三维分块策略，将权重矩阵和激活值在三维GPU网格上进行划分，降低通信量。
与PyTorch的分布式通信后端集成，并支持混合精度训练，以减少内存占用。
采用自定义通信调度，将计算与通信重叠，隐藏延迟。
通过扩展现有的PyTorch实现，引入三维张量并行，实现三维并行的Transformer架构。

实验结果

研究问题

RQ1三维模型并行设计在大规模Transformer的分布式训练中，是否能实现比一维或二维方法更优的负载均衡？
RQ2与现有的二维和一维方法相比，三维并行是否能降低通信和内存开销？
RQ3在大规模GPU集群中，三维方法在弱缩放和强缩放场景下的可扩展性如何？
RQ4三维并行是否能在保持模型精度的同时实现更高的训练吞吐量？

主要发现

在64块V100 GPU上，三维并行在强缩放下相比一维模型并行实现了2.32倍的加速。
在相同硬件配置下，三维方法相比二维模型并行实现了1.57倍的加速。
在弱缩放中，三维方法的平均步长时间增长最慢，表明其具有更优的可扩展性和更低的通信开销。
在所有GPU数量下，三维并行均保持最小的平均步长时间，证明了其负载均衡的最优性。
该方法通过在GPU间更均衡地分配计算和数据，降低了内存和通信开销。
在正向传播和反向传播时间上，三维方法均优于一维和二维方法，尤其在大规模场景下优势显著。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。