QUICK REVIEW

[论文解读] ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Guanhua Wang, Heyang Qin|arXiv (Cornell University)|Jun 16, 2023

Advanced Neural Network Applications被引用 15

一句话总结

ZeRO++ 引入三种通信量降低技术（qwZ、hpZ、qgZ），将 ZeRO 的跨节点数据移动减少至最高 4x，从而提升巨型模型训练的吞吐量。

ABSTRACT

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.

研究动机与目标

在多-GPU 集群中、节点间带宽受限的情况下，推动极大规模语言模型的高效训练。
在前向权重聚集、后向权重聚集与梯度规约等方面降低 ZeRO 的通信开销。
在不重构为 3D 并行或不牺牲收敛精度的前提下，实现更高吞吐量和可扩展性。

提出的方法

引入三种通信优化：qwZ（用于前向 all-gather 的基于块的 INT8 权重量化）、hpZ（在节点内的二次权重分区以消除跨节点的后向 all-gather）、qgZ（基于 all-to-all 的 INT4 梯度规约，伴随张量切片重新排序）。
采用基于块的量化以保持准确性，并实现用于量化、反量化和算子融合的高性能 CUDA 内核。
采用两阶段分层 all-to-all 梯度规约，先进行节点内通信再进行跨节点通信，以最小化跨节点流量。
重叠计算和通信以隐藏延迟，并融合内核以最小化内存传输。

Figure 1. Large scale training throughput are constrained by network bandwidth and batch size per GPU

实验结果

研究问题

RQ1在不牺牲极大模型训练精度的前提下，如何降低 ZeRO 的设备间通信？
RQ2哪种量化与分区技术组合可以在 ZeRO-3 中同时最小化前向/后向通信与梯度聚合？
RQ3基于 all-to-all 的低精度数据的新梯度规约范式是否能在降低通信量的同时保持收敛？

主要发现

ZeRO++ 将跨节点通信量从 3M 降至 0.75M，每次训练迭代，从而提高吞吐量。
在参数量为 10B 至 138B 的模型上，相较于 ZeRO 基线，最终吞吐量提升高达 2.4x。
演示可扩展至多达 384 GPU 的 GPT-3 类模型，达到持续峰值吞吐量的超过 45%。
保持与 ZeRO 基线相当的模型收敛性和训练精度。
在低带宽设置下，ZeRO++ 达到与高带宽基线配置相似的吞吐量。

Figure 2. Illustration & example of block based quantization vs. baseline

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。