QUICK REVIEW

[论文解读] ZeRO-Offload: Democratizing Billion-Scale Model Training

Jie Ren, Samyam Rajbhandari|arXiv (Cornell University)|Jan 18, 2021

Advanced Neural Network Applications参考文献 27被引用 61

一句话总结

ZeRO-Offload 将梯度、优化器状态和优化器计算移到 CPU，以在单个 GPU 上训练高达 13B 参数的模型，具备可扩展的性能和与 PyTorch 的无缝集成。

ABSTRACT

Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch, and it does so without requiring any model change from the data scientists or sacrificing computational efficiency. ZeRO-Offload enables large model training by offloading data and compute to CPU. To preserve compute efficiency, it is designed to minimize the data movement to/from GPU, and reduce CPU compute time while maximizing memory savings on GPU. As a result, ZeRO-Offload can achieve 40 TFlops/GPU on a single NVIDIA V100 GPU for 10B parameter model compared to 30TF using PyTorch alone for a 1.4B parameter model, the largest that can be trained without running out of memory. ZeRO-Offload is also designed to scale on multiple-GPUs when available, offering near linear speedup on up to 128 GPUs. Additionally, it can work together with model parallelism to train models with over 70 billion parameters on a single DGX-2 box, a 4.5x increase in model size compared to using model parallelism alone. By combining compute and memory efficiency with ease-of-use, ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to just a single GPU.

研究动机与目标

推动可访问的十亿参数模型训练的需求，并降低硬件门槛。
提出一种独特的卸载策略，在最大化 GPU 内存节省的同时，最小化 CPU 计算和 CPU-GPU 通信。
展示该方法从单个 GPU 到多达 128 个 GPU 的可扩展性，利用 ZeRO 驱动的数据并行性以及可能的模型并行性。
提供优化的 CPU 执行（快速 Adam）和一种调度技术，以保持吞吐量和准确性。

提出的方法

将模型训练表示为数据流图，并在 CPU 与 GPU 之间分区，以优化计算、通信和内存节省。
卸载策略将 FP16 参数保留在 GPU 上，同时将 FP16 梯度和 FP32 优化器状态卸载到 CPU，并在 CPU 上执行参数更新。
分区分析显示最小化 CPU 计算和 CPU-GPU 通信，从而得出最佳的卸载决策。
将卸载与 ZeRO-stage-2 数据并行相结合，在多达 128 个 GPU 实现近线性扩展，并与模型并行兼容以处理更大的模型。
使用经过优化的 CPU Adam 实现，利用 SIMD、循环展开和多线程，并采用一步延迟的参数更新以重叠 CPU 和 GPU 的工作。
一个延迟参数更新（DPU）选项允许将 CPU 更新与 GPU 计算重叠，以在不损失准确性的前提下提高吞吐量。

实验结果

研究问题

RQ1ZeRO-Offload 是否能够在不牺牲效率的情况下训练超过单个 GPU 内存容量的模型？
RQ2将卸载到 CPU 的过程如何与 ZeRO 数据并行性结合，以在大量 GPU 间实现扩展？
RQ3在单一/多 GPU 上训练十亿参数模型时的实际吞吐量和内存节省是多少？
RQ4经过优化的 CPU Adam 和 DPU 在提高性能的同时是否能保持模型收敛？

主要发现

在单个 NVIDIA V100 GPU 上以 40 TFLOPS 训练高达 13B 参数的模型，相比无需卸载时的 1.2B 参数。
在最多 128 个 GPU 上，与 ZeRO 驱动的数据并行性结合时实现近线性扩展。
在 DGX-2 节点上，将 ZeRO-Offload 与模型并行结合可训练高达 70B 参数的模型。
CPU-Adam 优化相比标准 PyTorch Adam 实现可获得超过 6x 的加速，在引入延迟参数更新时，端到端吞吐量提升可达约 1.5x。
该方法实现了在最小的 CPU 计算开销和受控通信下，训练大小达到原模型 10 倍的模型，而无需进行模型重构。
该卸载策略在最大化 GPU 内存节省、同时最小化 CPU 计算和 CPU-GPU 通信的既定目标下具有独特性和最优性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。