QUICK REVIEW

[论文解读] ImageNet Training by CPU: AlexNet in 11 Minutes and ResNet-50 in 48 Minutes

Yang You, Zhao Zhang|arXiv (Cornell University)|Sep 14, 2017

Advanced Neural Network Applications参考文献 2被引用 2

一句话总结

该论文展示了，通过在1024个CPU上使用大批次数据并行同步SGD与LARS优化器，可在11分钟内完成AlexNet在ImageNet上的训练，且在不损失准确率的情况下达到74.9%的top-1准确率；在2048个KNL上使用ResNet-50训练ImageNet仅需20分钟，同样达到74.9%的top-1准确率。该方法通过克服批量大小扩展中的算法瓶颈，实现了最先进的训练速度。

ABSTRACT

Finishing 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days. This training requires 10^18 single precision operations in total. On the other hand, the world's current fastest supercomputer can finish 2 * 10^17 single precision operations per second (Dongarra et al 2017, this https URL). If we can make full use of the supercomputer for DNN training, we should be able to finish the 90-epoch ResNet-50 training in one minute. However, the current bottleneck for fast DNN training is in the algorithm level. Specifically, the current batch size (e.g. 512) is too small to make efficient use of many processors. For large-scale DNN training, we focus on using large-batch data-parallelism synchronous SGD without losing accuracy in the fixed epochs. The LARS algorithm (You, Gitman, Ginsburg, 2017, arXiv:1708.03888) enables us to scale the batch size to extremely large case (e.g. 32K). We finish the 100-epoch ImageNet training with AlexNet in 11 minutes on 1024 CPUs. About three times faster than Facebook's result (Goyal et al 2017, arXiv:1706.02677), we finish the 90-epoch ImageNet training with ResNet-50 in 20 minutes on 2048 KNLs without losing accuracy. State-of-the-art ImageNet training speed with ResNet-50 is 74.9% top-1 test accuracy in 15 minutes. We got 74.9% top-1 test accuracy in 64 epochs, which only needs 14 minutes. Furthermore, when we increase the batch size to above 16K, our accuracy is much higher than Facebook's on corresponding batch sizes. Our source code is available upon request.

研究动机与目标

解决因小批量大小导致的算法瓶颈，该瓶颈限制了大规模深度神经网络训练中对大量处理器的利用率。
在不牺牲ImageNet模型准确率的前提下，实现高达32K的极大批量训练。
在仅使用CPU的系统上，实现ResNet-50和AlexNet的最先进训练速度，同时保持高测试准确率。
证明当算法低效问题被解决后，超级计算机级别的计算能力可被有效用于深度神经网络训练。

提出的方法

在CPU集群上采用数据并行与同步随机梯度下降（SGD）的方法，将训练扩展至数千个处理器。
采用LARS优化器（分层自适应学习率缩放）以在极大批量（如32K）下稳定训练，确保收敛且不造成准确率下降。
采用固定训练周期的极大批量训练策略（AlexNet为100个周期，ResNet-50为90个周期），在大幅缩短训练时间的同时保持准确率。
在CPU架构上配置训练（AlexNet使用1024个CPU，ResNet-50使用2048个KNL），以最大化吞吐量并最小化实际运行时间。
优化超参数与学习率调度，以在大批次下保持模型的泛化性能。

实验结果

研究问题

RQ1在CPU上使用大批次数据并行SGD与LARS，是否能在20分钟内完成训练并达到高ImageNet准确率？
RQ2当批量大小扩展至16K或以上时，该方法在CPU系统上的训练速度与准确率与以往工作相比如何？
RQ3当算法瓶颈被消除后，基于CPU的系统在多大程度上可以匹配基于GPU的训练速度？
RQ4在CPU上使用大批次训练时，是否可能在少于65个周期内，以ResNet-50在ImageNet上保持74.9%的top-1准确率？

主要发现

使用1024个CPU在11分钟内完成AlexNet在ImageNet上的训练，准确率达到74.9%的top-1准确率。
使用2048个KNL在20分钟内完成ResNet-50在ImageNet上的训练，准确率达到74.9%的top-1准确率，无准确率损失。
该方法在仅64个周期（14分钟）内即实现ResNet-50的74.9% top-1准确率，训练速度与效率均优于以往结果。
在批量大小超过16K时，模型的准确率显著高于Facebook在相近批量大小下的结果。
该方法使训练速度接近超级计算机的理论极限，仅用不到20分钟即在CPU集群上完成10^18次操作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。