QUICK REVIEW

[Paper Review] Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

Takuya Akiba, Shuji Suzuki|arXiv (Cornell University)|Nov 12, 2017

Advanced Neural Network Applications6 references281 citations

TL;DR

This paper trains ResNet-50 on ImageNet in 15 minutes using 1024 Tesla P100 GPUs with a minibatch of 32k, maintaining ~74.9% top-1 accuracy through RMSprop warm-up, slow-start learning rate, and BN without moving averages.

ABSTRACT

We demonstrate that training ResNet-50 on ImageNet for 90 epochs can be achieved in 15 minutes with 1024 Tesla P100 GPUs. This was made possible by using a large minibatch size of 32k. To maintain accuracy with this large minibatch size, we employed several techniques such as RMSprop warm-up, batch normalization without moving averages, and a slow-start learning rate schedule. This paper also describes the details of the hardware and software of the system used to achieve the above performance.

Motivation & Objective

Demonstrate ultra-fast training of a standard CNN on a large-scale dataset using extensive parallelism.
Show that high accuracy can be maintained with very large minibatch sizes.
Detail the hardware/software stack and training procedures enabling scalable learning.
Identify and validate methods that stabilize optimization at scale.

Proposed method

Use a 32k minibatch with 1024 GPUs for 90 epochs on ImageNet.
Apply RMSprop warm-up to ease early optimization and transition smoothly to SGD.
Implement a slow-start learning rate schedule to mitigate initial optimization difficulty.
Replace batch normalization moving averages with statistics from the last minibatch and synchronize via all-reduce.
Utilize Chainer and ChainerMN with NCCL and Open MPI, employing half-precision for communication to reduce overhead.
Provide detailed hardware (MN-1 cluster) and software configurations enabling reproducible large-scale training.

Experimental results

Research questions

RQ1Can ResNet-50 be trained on ImageNet with a minibatch size of 32k without sacrificing accuracy?
RQ2What training procedure adjustments (e.g., optimizer warm-up, slow-start LR, BN statistics handling) are needed to stabilize extreme minibatch SGD?
RQ3What are the hardware/software requirements and scalability characteristics when training with extremely large minibatches?

Key findings

90-epoch training of ResNet-50 on ImageNet with 32k minibatch and 1024 GPUs achieves 74.9% top-1 accuracy.
Total training time is 15 minutes (897.9 ± 3.3 seconds per run for 90 epochs on 1024 GPUs).
Scaling efficiency is 70% versus a single-GPU baseline and 80% versus a single-node (8 GPUs) baseline.
Compared to prior works, extreme minibatch training is viable with careful algorithmic and system design.
The method demonstrates stability and accuracy comparable to prior ResNet-50 results despite large minibatch sizes.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.