QUICK REVIEW
[Paper Review] Multi-GPU Training of ConvNets
Omry Yadan, Keith Adams|arXiv (Cornell University)|Dec 20, 2013
Advanced Neural Network Applications11 references53 citations
TL;DR
This paper investigates multi-GPU training of convolutional neural networks (ConvNets) using data parallelism, model parallelism, and their hybrid combination. By combining both strategies on 4 GPUs, the authors achieve a 2.2x speedup over single-GPU training, significantly reducing training time for ImageNet classification while maintaining convergence stability.
ABSTRACT
In this work we evaluate different approaches to parallelize computation of convolutional neural networks across several GPUs.
Motivation & Objective
- Address the long training times of large-scale ConvNets by exploring parallelization strategies across multiple GPUs.
- Investigate the trade-offs between data parallelism and model parallelism in terms of communication overhead and hardware utilization.
- Determine the optimal configuration for accelerating training on multi-GPU setups without modifying the underlying optimization algorithm.
- Evaluate the feasibility and performance of hybrid parallelism (data + model) to maximize GPU utilization and minimize training time.
Proposed method
- Implement data parallelism by splitting the mini-batch (size 256) across multiple GPUs, with each GPU computing gradients on a subset of samples.
- Implement model parallelism by partitioning the network architecture across GPUs, such as splitting filters or layers between devices, as in Krizhevsky et al. [1].
- Combine data and model parallelism by distributing both data samples and network components across 4 GPUs to balance load and reduce communication bottlenecks.
- Use synchronous mini-batch stochastic gradient descent with standard back-propagation, isolating the impact of parallelism from optimization changes.
- Measure training time and test error over 100 epochs on the ImageNet 2012 dataset using NVIDIA TITAN GPUs with 6GB RAM.
- Communicate gradients and model parameters between GPUs via PCIe, simulating distributed communication overhead in a single-server setup.
Experimental results
Research questions
- RQ1How does data parallelism compare to model parallelism in terms of training speed and communication cost on multi-GPU setups?
- RQ2Can hybrid data and model parallelism achieve better performance than either strategy alone?
- RQ3What is the maximum speedup achievable when training a large ConvNet on 4 GPUs using different parallelization schemes?
- RQ4How does mini-batch size distribution affect GPU utilization and convergence in multi-GPU training?
- RQ5What are the practical limitations of single-GPU memory when training large models, and how can multi-GPU strategies mitigate them?
Key findings
- The hybrid approach combining data and model parallelism on 4 GPUs achieved a 2.2x speedup compared to single-GPU training, reducing training time from 10.5 days to 4.8 days.
- Data parallelism on 2 GPUs yielded a 1.5x speedup, while model parallelism on 2 GPUs achieved a 1.6x speedup, indicating model parallelism is slightly more efficient for this setup.
- Using 4 GPUs with pure data parallelism resulted in only a 1.4x speedup (7.2 days), suggesting diminishing returns due to increased communication overhead.
- The hybrid configuration on 4 GPUs achieved the fastest convergence, with test error decreasing most rapidly over time, as shown in Figure 1.
- Mini-batch sizes below 64 samples underutilized GPU cores, while sizes above 256 were constrained by single-GPU memory limits (6GB RAM), making 256 the optimal batch size.
- Communication overhead significantly impacts performance, especially in data parallelism, where gradients and parameters must be synchronized across all GPUs on every update step.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.