QUICK REVIEW

[论文解读] Reducing Activation Recomputation in Large Transformer Models

Vijay Anand Korthikanti, Jared Casper|arXiv (Cornell University)|May 10, 2022

Parallel Computing and Optimization Techniques被引用 52

一句话总结

本文提出序列并行和选择性激活重计算，以大幅降低激活内存和重计算开销，使训练极大规模的Transformer在更少内存下实现，并且吞吐量大约提高30%。

ABSTRACT

Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomputed, which saves memory but adds redundant compute. In this work, we show most of this redundant compute is unnecessary because we can reduce memory consumption sufficiently without it. We present two novel yet very simple techniques: sequence parallelism and selective activation recomputation. In conjunction with tensor parallelism, these techniques almost eliminate the need to recompute activations. We evaluate our approach on language models up to one trillion parameters in scale and show that our method reduces activation memory by 5x, while reducing execution time overhead from activation recomputation by over 90%. For example, when training a 530B parameter GPT-3 style model on 2240 NVIDIA A100 GPUs, we achieve a Model Flops Utilization of 54.2%, which is 29% faster than the 42.1% we achieve using recomputation. Our implementation will be available in both Megatron-LM and NeMo-Megatron.

研究动机与目标

揭示大规模Transformer训练中激活存储的内存瓶颈。
开发在不产生大量计算开销的情况下减少激活内存的方法。
结合张量并行和序列并行以最小化重计算。
引入选择性激活重计算，进一步降低内存和开销。
提供近万亿参数规模模型的实证评估。

提出的方法

推导Transformer层的激活内存估计以量化内存需求（方程式1）。
应用张量并行在注意力和MLP模块内分割激活（方程式2）。
引入序列并行在序列维度上划分非张量并行区域（导致方程式4）。
并将张量并行、序列并行与流水线并行相结合，并推导转换器 g 与 g 以最小化额外通信（图5和图6）。
提出选择性激活重计算，目标是高激活区域（如与注意力相关的运算），以在适度 FLOP 开销下减少内存（方程式6）。
使用 Megatron-LM 和 NeMo-Megatron 实现评估高达1T参数模型的内存和运行时影响。

实验结果

研究问题

RQ1在标准内存布局下训练大型Transformer模型需要多少激活内存？
RQ2混合张量/序列/流水线并行是否能在不带来大量计算/通信开销的情况下减少激活内存？
RQ3选择性激活重计算是否在保持较低FLOP开销的同时显著降低内存？
RQ4在应用这些技术时，万亿参数Transformer的端到端吞吐提升是多少？

主要发现

在应用序列并行与选择性激活重计算共同作用时，激活内存大约可降低约5倍。
选择性激活重计算将重计算FLOP开销降至大型模型下的3%以下，而全量重计算为30-40%。
与全量重计算相比，端到端迭代吞吐量在测试配置中提升约29-32%。
对于 GPT-3 530B 和 MT-NLG，该方法显著降低每个Transformer层的内存，并实现无需大量重计算的训练。
在2240-A100 的 530B GPT-3风格模型的配置下，模型FLOPs利用率为54.2%，比基线42.1%快29%。
内存节省随模型规模扩大而放大，总体约5倍，且在基线激活将超出设备内存的情况下实现训练。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。