QUICK REVIEW

[论文解读] Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

Xianyan Jia, Shutao Song|arXiv (Cornell University)|Jul 30, 2018

Advanced Neural Network Applications参考文献 24被引用 313

一句话总结

论文提出一个可扩展的训练系统，使用混合精度和 LARS，使 64K 小批量并行训练成为可能，并通过优化的全规约来训练 ImageNet 模型（AlexNet 和 ResNet-50），在几分钟内完成而非数小时，超越先前系统。

ABSTRACT

Synchronized stochastic gradient descent (SGD) optimizers with data parallelism are widely used in training large-scale deep neural networks. Although using larger mini-batch sizes can improve the system scalability by reducing the communication-to-computation ratio, it may hurt the generalization ability of the models. To this end, we build a highly scalable deep learning training system for dense GPU clusters with three main contributions: (1) We propose a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy. (2) We propose an optimization approach for extremely large mini-batch size (up to 64k) that can train CNN models on the ImageNet dataset without losing accuracy. (3) We propose highly optimized all-reduce algorithms that achieve up to 3x and 11x speedup on AlexNet and ResNet-50 respectively than NCCL-based training on a cluster with 1024 Tesla P40 GPUs. On training ResNet-50 with 90 epochs, the state-of-the-art GPU-based system with 1024 Tesla P100 GPUs spent 15 minutes and achieved 74.9\% top-1 test accuracy, and another KNL-based system with 2048 Intel KNLs spent 20 minutes and achieved 75.4\% accuracy. Our training system can achieve 75.8\% top-1 test accuracy in only 6.6 minutes using 2048 Tesla P40 GPUs. When training AlexNet with 95 epochs, our system can achieve 58.7\% top-1 test accuracy within 4 minutes, which also outperforms all other existing systems.

研究动机与目标

在实现高吞吐量的同时，解决大规模小批量训练的泛化风险。
开发带有 LARS 的混合精度训练，以在非常大的小批量规模下保持准确性。
设计用于在数千个 GPU 上实现可扩展通信的优化 all-reduce 算法。
在 ImageNet 上展示 AlexNet 和 ResNet-50 的最新训练速度。
在实际硬件上评估大规模 GPU 集群上的收敛性和可扩展性。

提出的方法

引入带有 LARS 的混合精度训练，以在不损失精度的情况下实现大规模小批量。
在前向/后向传播中使用 FP16，主权重用 FP32 以实现稳定更新。
对 BN 的偏置和 BN 参数取消权重衰减，并为 AlexNet 增加额外的 BN 层以改善收敛。
开发张量融合和混合 all-reduce 策略，结合分层和环形方法以实现梯度聚合的可扩展性。
在 1024-GPU 和 2048-GPU 设置中，使用 RoCEv2 和 GPUDirect RDMA 以降低通信延迟并提高带宽。

实验结果

研究问题

RQ1带有 LARS 的混合精度训练是否能在高达 64K 的小批量下保持 ImageNet 的准确性？
RQ2为在极大规模的小批量下维持收敛性，需要哪些架构和优化调整？
RQ3如何优化 all-reduce 策略以在大型 GPU 集群上实现高可扩展性？
RQ4通信优化对 AlexNet 和 ResNet-50 的整体训练时间有何影响？

主要发现

带有 LARS 的混合精度训练在 64K 小批量、90 轮时仍能保持 ResNet-50 的 top-1 精度（使用 LARS 时为 76.2%）。
在 64K 小批量、经过有针对性的架构调整的 AlexNet 在 95 轮后达到 58.8% 的 top-1 精度。
该系统在 1024 和 2048 张 Tesla P40 GPU 上分别实现 4 分钟（AlexNet，95 轮）和 6.6 分钟（ResNet-50，90 轮）的训练时间。
该方法在 1024-GPU 集群上，相较于基于 NCCL 的训练，对 AlexNet 提供最多 3x、对 ResNet-50 提供最多 11x 的加速。
总的来说，ResNet-50 在 2048 GPU 上以 6.6 分钟达到 75.8% 的 top-1 精度，且该 75.8% 的结果在与前人工作相比时具有竞争力的效率。
在 1024 GPU 上，当使用优化的 all-reduce 和张量融合时，缩放效率从 9.0% 提高到 99.2%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。