QUICK REVIEW

[论文解读] ImageNet Training in Minutes

Yang You, Zhao Zhang|arXiv (Cornell University)|Sep 14, 2017

Advanced Neural Network Applications参考文献 24被引用 45

一句话总结

该论文表明，使用逐层自适应学习率缩放（LARS）优化器，可在 2048 个 KNL 处理器上仅用 20 分钟训练 ImageNet-1k 的 ResNet-50，达到 74.9% 的 top-1 准确率——与最先进结果在 15 分钟内达成相同准确率相比，训练时间缩短了 14 分钟。该方法通过在高达 32K 的批量大小下保持准确率，实现了大批次随机梯度下降的高效扩展，显著缩短了训练时间，且未牺牲模型性能。

ABSTRACT

Finishing 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days. This training requires 10^18 single precision operations in total. On the other hand, the world's current fastest supercomputer can finish 2 * 10^17 single precision operations per second (Dongarra et al 2017, https://www.top500.org/lists/2017/06/). If we can make full use of the supercomputer for DNN training, we should be able to finish the 90-epoch ResNet-50 training in one minute. However, the current bottleneck for fast DNN training is in the algorithm level. Specifically, the current batch size (e.g. 512) is too small to make efficient use of many processors. For large-scale DNN training, we focus on using large-batch data-parallelism synchronous SGD without losing accuracy in the fixed epochs. The LARS algorithm (You, Gitman, Ginsburg, 2017, arXiv:1708.03888) enables us to scale the batch size to extremely large case (e.g. 32K). We finish the 100-epoch ImageNet training with AlexNet in 11 minutes on 1024 CPUs. About three times faster than Facebook's result (Goyal et al 2017, arXiv:1706.02677), we finish the 90-epoch ImageNet training with ResNet-50 in 20 minutes on 2048 KNLs without losing accuracy. State-of-the-art ImageNet training speed with ResNet-50 is 74.9% top-1 test accuracy in 15 minutes. We got 74.9% top-1 test accuracy in 64 epochs, which only needs 14 minutes. Furthermore, when we increase the batch size to above 16K, our accuracy is much higher than Facebook's on corresponding batch sizes. Our source code is available upon request.

研究动机与目标

探究 LARS 优化器是否能够实现 DNN 训练在更多处理器上的高效扩展。
将 ImageNet-1k 的训练时间从单个 GPU 上的 14 天缩短至数分钟，采用大批次同步 SGD。
在极高的批量大小（最高达 32K）下维持高测试准确率（74.9% top-1），且无需数据增强。
评估在大批次大小下，数据增强对准确率的影响。
证明使用 LARS 的大批次训练可在 ResNet-50 和 AlexNet 上实现最先进水平的速度-准确率权衡。

提出的方法

利用逐层自适应学习率缩放（LARS）算法，在极高批量大小（最高达 32K）下稳定训练。
使用批量大小递增的同步随机梯度下降（SGD），在 1024 个 CPU（AlexNet）和 2048 个 KNL（ResNet-50）上实现扩展。
通过基于梯度范数与权重范数比值的逐层学习率调整，维持模型准确率。
通过增加批量大小优化通信与计算的平衡，减少迭代次数和通信轮次。
采用预热学习率调度策略，以在大批次大小下稳定训练。
通过消融研究，比较在大批次大小下使用与不使用数据增强时的准确率。

实验结果

研究问题

RQ1LARS 优化器是否能够在 ImageNet-1k 上实现高达 32K 的批量大小下的稳定且准确的训练？
RQ2在使用 LARS 且准确率下降最小的情况下，DNN 训练中可高效利用的最大处理器数量是多少？
RQ3在使用 LARS 时，随着批量大小和处理器数量的增加，训练时间如何变化？
RQ4与标准训练相比，大批次大小下缺乏数据增强对准确率有何影响？
RQ5使用 LARS 的大批次训练是否能在显著少于以往方法的训练分钟数内实现最先进水平的准确率？

主要发现

使用 1024 个 CPU 和 32K 批量大小，AlexNet 在 11 分钟内完成 100 个周期的 ImageNet 训练，达到 58.6% 的 top-1 准确率。
使用 2048 个 KNL 处理器和 32K 批量大小，ResNet-50 在 20 分钟内完成 90 个周期的 ImageNet 训练，达到 74.9% 的 top-1 准确率。
作者仅用 64 个周期（14 分钟）即达到 74.9% 的 top-1 准确率，优于此前需要 15 分钟达到相同准确率的最先进结果。
在批量大小超过 16K 时，所提方法的准确率显著高于 Facebook 的对应结果，尤其在无数据增强时更为明显。
通信量随批量大小成比例减少，与小批量训练相比，在相同浮点运算量下，数据移动量最多减少 90%。
基于 LARS 的训练保持了高准确率扩展效率，在大批次大小下，其速度-准确率权衡优于以往方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。