[Paper Review] Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes
This paper trains ResNet-50 on ImageNet in 15 minutes using 1024 Tesla P100 GPUs with a minibatch of 32k, maintaining ~74.9% top-1 accuracy through RMSprop warm-up, slow-start learning rate, and BN without moving averages.
We demonstrate that training ResNet-50 on ImageNet for 90 epochs can be achieved in 15 minutes with 1024 Tesla P100 GPUs. This was made possible by using a large minibatch size of 32k. To maintain accuracy with this large minibatch size, we employed several techniques such as RMSprop warm-up, batch normalization without moving averages, and a slow-start learning rate schedule. This paper also describes the details of the hardware and software of the system used to achieve the above performance.
Motivation & Objective
- Demonstrate ultra-fast training of a standard CNN on a large-scale dataset using extensive parallelism.
- Show that high accuracy can be maintained with very large minibatch sizes.
- Detail the hardware/software stack and training procedures enabling scalable learning.
- Identify and validate methods that stabilize optimization at scale.
Proposed method
- Use a 32k minibatch with 1024 GPUs for 90 epochs on ImageNet.
- Apply RMSprop warm-up to ease early optimization and transition smoothly to SGD.
- Implement a slow-start learning rate schedule to mitigate initial optimization difficulty.
- Replace batch normalization moving averages with statistics from the last minibatch and synchronize via all-reduce.
- Utilize Chainer and ChainerMN with NCCL and Open MPI, employing half-precision for communication to reduce overhead.
- Provide detailed hardware (MN-1 cluster) and software configurations enabling reproducible large-scale training.
Experimental results
Research questions
- RQ1Can ResNet-50 be trained on ImageNet with a minibatch size of 32k without sacrificing accuracy?
- RQ2What training procedure adjustments (e.g., optimizer warm-up, slow-start LR, BN statistics handling) are needed to stabilize extreme minibatch SGD?
- RQ3What are the hardware/software requirements and scalability characteristics when training with extremely large minibatches?
Key findings
- 90-epoch training of ResNet-50 on ImageNet with 32k minibatch and 1024 GPUs achieves 74.9% top-1 accuracy.
- Total training time is 15 minutes (897.9 ± 3.3 seconds per run for 90 epochs on 1024 GPUs).
- Scaling efficiency is 70% versus a single-GPU baseline and 80% versus a single-node (8 GPUs) baseline.
- Compared to prior works, extreme minibatch training is viable with careful algorithmic and system design.
- The method demonstrates stability and accuracy comparable to prior ResNet-50 results despite large minibatch sizes.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.