QUICK REVIEW

[论文解读] Reducing BERT Pre-Training Time from 3 Days to 76 Minutes

Yang You, Jing Li|arXiv (Cornell University)|Apr 1, 2019

Advanced Neural Network Applications被引用 73

一句话总结

本文提出 LAMB（层自适应矩），一种新颖的大批量优化方法，通过在 TPUv3 集群上支持高达 32,868 的极大批量，将 BERT 预训练时间从 3 天缩短至仅 76 分钟。LAMB 通过层自适应学习率和形式化收敛保证实现这一突破，在 BERT 和 ResNet-50 上均优于先前方法。

ABSTRACT

Training large deep neural networks on massive datasets is very challenging. One promising approach to tackle this issue is through the use of large batch stochastic optimization. However, our understanding of this approach in the context of deep learning is still very limited. Furthermore, the current approaches in this direction are heavily hand-tuned. To this end, we first study a general adaptation strategy to accelerate training of deep neural networks using large minibatches. Using this strategy, we develop a new layer-wise adaptive large batch optimization technique called LAMB. We also provide a formal convergence analysis of LAMB as well as the previous published layerwise optimizer LARS, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of LAMB for BERT and ResNet-50 training. In particular, for BERT training, our optimization technique enables use of very large batches sizes of 32868; thereby, requiring just 8599 iterations to train (as opposed to 1 million iterations in the original paper). By increasing the batch size to the memory limit of a TPUv3 pod, BERT training time can be reduced from 3 days to 76 minutes. Finally, we also demonstrate that LAMB outperforms previous large-batch training algorithms for ResNet-50 on ImageNet; obtaining state-of-the-art performance in just a few minutes.

研究动机与目标

为解决 BERT 等大型深度神经网络训练时间过长的问题，这类模型需要大量计算资源和漫长的训练时间。
提升深度学习中大规模小批量随机优化的效率，该方法在实践中仍理解不足且高度依赖人工调参。
开发一种可泛化的自适应优化策略，实现在极大小批量下的稳定且快速训练。
为 LAMB 和先前的 LARS 优化器在非凸设置下提供形式化收敛分析，确保理论稳健性。
在 BERT 和 ResNet-50 上实现最先进性能，显著缩短训练时间并减少迭代次数。

提出的方法

提出 LAMB，一种基于每层梯度范数与参数范数比值进行学习率缩放的层自适应大规模优化技术。
通过引入层自适应学习率，对 Adam 优化器进行改进，提升在大规模小批量训练下的稳定性和收敛性。
引入一种归一化机制，通过平衡各层梯度与参数的大小，稳定训练过程。
采用形式化收敛分析，证明 LAMB 和 LARS 在一般非凸设置下均可收敛至驻点。
利用 TPUv3 集群的内存容量，将批量大小扩展至 32,868，显著减少训练迭代次数。
采用学习率调度策略，在极大批量下仍能保持模型稳定性，避免标准大规模小批量方法常见的发散问题。

实验结果

研究问题

RQ1是否存在一种通用的大批量训练自适应策略，能显著缩短 BERT 的预训练时间，同时不损害模型质量？
RQ2层自适应学习率如何提升大规模小批量设置下深层网络的优化稳定性和收敛性？
RQ3LAMB 是否在非凸优化中实现与 LARS 等先前方法类似的理论收敛保证？
RQ4LAMB 是否能支持 BERT 在接近现代加速器（如 TPUv3 集群）内存极限的批量大小下进行训练？
RQ5在 ImageNet 和 GLUE 等标准基准上，LAMB 与现有大规模小批量优化方法相比，在准确率和训练速度方面表现如何？

主要发现

LAMB 通过在 TPUv3 集群上启用 32,868 的批量大小，将 BERT 预训练时间从 3 天缩短至仅 76 分钟。
使用 LAMB 时，BERT 训练仅需 8,599 次迭代，远少于原始 BERT 论文中 100 万次迭代。
LAMB 仅用几分钟就在 ResNet-50/ImageNet 上达到最先进性能，优于以往的大批量方法。
该方法在极大批量下表现出稳定训练，避免了标准大规模小批量优化中常见的发散问题。
形式化收敛分析证实，LAMB 在一般非凸设置下可收敛至驻点，提供了理论依据。
LAMB 在 BERT 和 ResNet-50 上均优于先前的大批量训练算法，在速度和准确率方面均表现出一致提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。