QUICK REVIEW

[论文解读] Training Large Neural Networks with Constant Memory using a New Execution Algorithm

Bharadwaj Pudipeddi, Maral Mesmakhosroshahi|arXiv (Cornell University)|Feb 13, 2020

Ferroelectric and Negative Capacitance Devices参考文献 15被引用 24

一句话总结

本文提出 L2L（层间传输）这一新型执行算法，通过将完整模型卸载至基于 CPU 的即时参数服务器（EPS），仅将当前层的参数和激活值保留在 GPU 内存中，从而实现在恒定内存下的大模型训练。该方法相比最先进基线实现 45% 的内存降低和 40% 的吞吐量提升，使单张 16GB V100 GPU 搭配 512GB CPU 内存即可训练 500 亿参数模型，且无需模型分片或限制批量大小。

ABSTRACT

Widely popular transformer-based NLP models such as BERT and Turing-NLG have enormous capacity trending to billions of parameters. Current execution methods demand brute-force resources such as HBM devices and high speed interconnectivity for data parallelism. In this paper, we introduce a new relay-style execution technique called L2L (layer-to-layer) where at any given moment, the device memory is primarily populated only with the executing layer(s)'s footprint. The model resides in the DRAM memory attached to either a CPU or an FPGA as an entity we call eager param-server (EPS). To overcome the bandwidth issues of shuttling parameters to and from EPS, the model is executed a layer at a time across many micro-batches instead of the conventional method of minibatches over whole model. L2L is implemented using 16GB V100 devices for BERT-Large running it with a device batch size of up to 256. Our results show 45% reduction in memory and 40% increase in the throughput compared to the state-of-the-art baseline. L2L is also able to fit models up to 50 Billion parameters on a machine with a single 16GB V100 and 512GB CPU memory and without requiring any model partitioning. L2L scales to arbitrary depth allowing researchers to develop on affordable devices which is a big step toward democratizing AI. By running the optimizer in the host EPS, we show a new form of mixed precision for faster throughput and convergence. In addition, the EPS enables dynamic neural architecture approaches by varying layers across iterations. Finally, we also propose and demonstrate a constant memory variation of L2L and we propose future enhancements. This work has been performed on GPUs first, but also targeted towards all high TFLOPS/Watt accelerators.

研究动机与目标

为应对 BERT 和 GPT-3 等大规模 Transformer 模型日益增长的内存与计算需求，这些模型已超出标准 GPU 的承载能力。
在不依赖高带宽内存（HBM）设备或模型分片的前提下，实现在经济型硬件上训练数十亿参数模型。
开发一种恒定内存的执行方法，可扩展至任意模型深度，并支持动态神经架构搜索。
通过将模型权重和优化器状态迁移至基于 CPU 的即时参数服务器（EPS），并按层顺序执行，降低内存压力并提升吞吐量。
通过一种新颖的、低开销的 GPU 与 CPU 间参数传输机制，实现混合精度训练和高效的并行数据处理。

提出的方法

L2L 采用类似 Relay 的执行方式，仅将当前层的参数和激活值存储在 GPU 内存中，而完整模型则驻留在 CPU 或 FPGA 的 DRAM 中作为即时参数服务器（EPS）。
EPS 在执行前预先加载并传输下一层的参数，通过循环内层优化减少空闲时间，并降低传输频率。
该方法按层顺序处理微批次（microbatches），而非完整小批量（minibatches），从而减少内存占用，并实现与模型深度无关的恒定内存使用。
EPS 与 GPU 计算并行执行梯度聚合和权重更新，从而实现一种新型混合精度训练，具有更快收敛速度。
未来扩展版本 L2Lp 引入 EPS 中完全并行的梯度聚合与权重更新，仅通过高速 NVLinks 传输下一层参数，显著降低对带宽的依赖。
该方法通过支持每轮迭代中独立执行且可动态修改的层结构，实现对动态神经架构搜索的支持。

实验结果

研究问题

RQ1能否通过将模型卸载至基于 CPU 的参数服务器，在标准 GPU 上实现大尺寸 Transformer 模型的恒定内存训练？
RQ2与传统小批量训练相比，基于微批次的逐层执行在内存使用和吞吐量方面表现如何？
RQ3L2L 方法是否可在单张 16GB V100 上成功训练极深模型（如 384 层），而不会因内存耗尽而失败？
RQ4基于 EPS 的架构在优化参数传输与混合精度训练的协同作用下，能在多大程度上实现更快收敛与更高吞吐量？
RQ5L2L 是否可通过支持每轮迭代中无需重新编译或重新配置即可修改层结构，实现对动态神经架构搜索的支持？

主要发现

在单张 16GB V100 GPU 上训练 BERT-Large 时，L2L 相较最先进基线将 GPU 内存使用降低 45%。
该方法在降低内存压力的同时，使训练吞吐量相较基线提升 40%。
L2L 在单张 16GB V100 上成功训练 BERT-Large 模型，设备批量大小最高可达 256，相较基线在批量大小为 2 时即遇困难，实现显著提升。
该方法在单张 16GB V100 和 512GB CPU 内存下支持高达 500 亿参数的模型，无需模型分片或出现内存溢出错误。
无论模型深度如何，L2L 均保持恒定内存使用，可在不发生内存溢出的前提下训练最多 384 层的模型。
验证曲线显示，L2L 在 FP32 和混合精度模式下均比基线收敛更快，表明训练效率得到提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。