QUICK REVIEW

[論文レビュー] Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

Peng Sun, Wansen Feng|arXiv (Cornell University)|Feb 19, 2019

Advanced Neural Network Applications参考文献 45被引用数 50

ひとこと要約

本論文は GradientFlow という通信バックエンドと、分散DNNトレーニングを GPU クラスター上で加速するネットワーク最適化のセット（lazy allreduce および coarse-grained sparse communication）を提案し、ImageNet/AlexNet および ImageNet/ResNet-50 で極めて高い加速を達成する。

ABSTRACT

It is important to scale out deep neural network (DNN) training for reducing model training time. The high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. Our investigations have shown that popular open-source DNN systems could only achieve 2.5 speedup ratio on 64 GPUs connected by 56 Gbps network. To address this problem, we propose a communication backend named GradientFlow for distributed DNN training, and employ a set of network optimization techniques. First, we integrate ring-based allreduce, mixed-precision training, and computation/communication overlap into GradientFlow. Second, we propose lazy allreduce to improve network throughput by fusing multiple communication operations into a single one, and design coarse-grained sparse communication to reduce network traffic by only transmitting important gradient chunks. When training ImageNet/AlexNet on 512 GPUs, our approach achieves 410.2 speedup ratio and completes 95-epoch training in 1.5 minutes, which outperforms existing approaches.

研究の動機と目的

Motivate reducing training time for large-scale DNNs by mitigating communication bottlenecks in distributed training.
Assess limitations of existing open-source DNN systems in scaling to hundreds of GPUs over 56 Gbps networks.
Develop a communication backend with enhancements to improve throughput and reduce network traffic.
Demonstrate effectiveness on ImageNet with AlexNet and ResNet-50 to quantify speedups.
Provide a comparison baseline against existing approaches and highlight remaining gaps in utilization.

提案手法

Implement GradientFlow as a communication backend for the System-I distributed DNN system.
Integrate ring-based allreduce, mixed-precision training, and computation/communication overlap.
Introduce lazy allreduce to fuse multiple gradient transmissions into fewer, larger operations.
Design coarse-grained sparse communication to transmit only important gradient chunks while maintaining model quality.

実験結果

リサーチクエスチョン

RQ1Can ring-based allreduce with mixed precision and overlap achieve near-linear scaling on large GPU clusters?
RQ2To what extent do lazy allreduce and coarse-grained sparse communication reduce network traffic and improve throughput for AlexNet and ResNet-50 on ImageNet?
RQ3How do the proposed techniques compare to existing backends (e.g., Gloo, NCCL, MPI) in throughput and utilization on 56 Gbps networks?
RQ4What is the impact of these optimizations on training time and speedup for large-scale ImageNet experiments?

主な発見

On 512 GPUs, AlexNet achieves 410.2x and ResNet-50 434.1x speedup with the proposed approach.
Training ImageNet/AlexNet reaches 95 epochs in 1.5 minutes on 512 GPUs.
Training ImageNet/ResNet-50 reaches 90 epochs in 7.3 minutes on 512 GPUs.
Compared to Jia et al. (4 minutes with 1024 GPUs), the approach is 2.6x faster.
Compared to Akiba et al. (15 minutes with 1024 GPUs), the approach is 2.1x faster.
Even with optimizations, GPU resource utilization remains far from linear, e.g., 18.5% and 26.2% on Cluster-V for AlexNet and ResNet-50 respectively.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。