Skip to main content
QUICK REVIEW

[Paper Review] Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

Takuya Akiba, Shuji Suzuki|arXiv (Cornell University)|Nov 12, 2017
Advanced Neural Network Applications6 references281 citations
TL;DR

This paper trains ResNet-50 on ImageNet in 15 minutes using 1024 Tesla P100 GPUs with a minibatch of 32k, maintaining ~74.9% top-1 accuracy through RMSprop warm-up, slow-start learning rate, and BN without moving averages.

ABSTRACT

We demonstrate that training ResNet-50 on ImageNet for 90 epochs can be achieved in 15 minutes with 1024 Tesla P100 GPUs. This was made possible by using a large minibatch size of 32k. To maintain accuracy with this large minibatch size, we employed several techniques such as RMSprop warm-up, batch normalization without moving averages, and a slow-start learning rate schedule. This paper also describes the details of the hardware and software of the system used to achieve the above performance.

Motivation & Objective

  • Demonstrate ultra-fast training of a standard CNN on a large-scale dataset using extensive parallelism.
  • Show that high accuracy can be maintained with very large minibatch sizes.
  • Detail the hardware/software stack and training procedures enabling scalable learning.
  • Identify and validate methods that stabilize optimization at scale.

Proposed method

  • Use a 32k minibatch with 1024 GPUs for 90 epochs on ImageNet.
  • Apply RMSprop warm-up to ease early optimization and transition smoothly to SGD.
  • Implement a slow-start learning rate schedule to mitigate initial optimization difficulty.
  • Replace batch normalization moving averages with statistics from the last minibatch and synchronize via all-reduce.
  • Utilize Chainer and ChainerMN with NCCL and Open MPI, employing half-precision for communication to reduce overhead.
  • Provide detailed hardware (MN-1 cluster) and software configurations enabling reproducible large-scale training.

Experimental results

Research questions

  • RQ1Can ResNet-50 be trained on ImageNet with a minibatch size of 32k without sacrificing accuracy?
  • RQ2What training procedure adjustments (e.g., optimizer warm-up, slow-start LR, BN statistics handling) are needed to stabilize extreme minibatch SGD?
  • RQ3What are the hardware/software requirements and scalability characteristics when training with extremely large minibatches?

Key findings

  • 90-epoch training of ResNet-50 on ImageNet with 32k minibatch and 1024 GPUs achieves 74.9% top-1 accuracy.
  • Total training time is 15 minutes (897.9 ± 3.3 seconds per run for 90 epochs on 1024 GPUs).
  • Scaling efficiency is 70% versus a single-GPU baseline and 80% versus a single-node (8 GPUs) baseline.
  • Compared to prior works, extreme minibatch training is viable with careful algorithmic and system design.
  • The method demonstrates stability and accuracy comparable to prior ResNet-50 results despite large minibatch sizes.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.