Skip to main content
QUICK REVIEW

[Paper Review] Parle: parallelizing stochastic gradient descent

Pratik Chaudhari, Carlo Baldassi|arXiv (Cornell University)|Jul 3, 2017
Stochastic Gradient Optimization Techniques35 references18 citations
TL;DR

Parle is a novel parallel training algorithm for deep neural networks that accelerates convergence by 2–4× over data-parallel SGD while achieving state-of-the-art generalization error on CIFAR-10 and CIFAR-100. It uses multiple model replicas trained with entropy regularization and infrequent communication via a proximal coupling term, enabling efficient scaling across multi-GPU and distributed systems without additional hyperparameters.

ABSTRACT

We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error rates that are nearly state-of-the-art on several benchmarks including CIFAR-10 and CIFAR-100, without introducing any additional hyper-parameters. We exploit the phenomenon of flat minima that has been shown to lead to improved generalization error for deep networks. Parle requires very infrequent communication with the parameter server and instead performs more computation on each client, which makes it well-suited to both single-machine, multi-GPU settings and distributed implementations.

Motivation & Objective

  • Address the trade-off between communication cost and generalization performance in distributed SGD training of deep networks.
  • Overcome the limitations of large-batch SGD (poor generalization) and small-batch SGD (high communication overhead).
  • Enable efficient, scalable parallel training in both single-machine multi-GPU and distributed environments with minimal hyperparameter tuning.
  • Leverage the concept of flat minima to improve generalization while reducing communication frequency.
  • Develop a unified framework that combines entropy regularization and elastic averaging for robust, scalable optimization.

Proposed method

  • Train multiple replicas of the same model in parallel, each performing multiple gradient steps on a subset of data.
  • Use a modified loss function called 'local entropy' $ f_{ ho}(x) = -\log\left(G_{\gamma} * e^{-f(x)}\right) $ to smooth the non-convex loss landscape and encourage flat minima.
  • Couple replicas via a proximal term $ \frac{1}{2\rho} \|x^a - x\|^2 $ that enforces consensus toward a shared reference parameter $ x $, reducing communication frequency.
  • Gradually decrease $ \gamma \to 0 $ and $ \rho \to 0 $ via 'scoping' to collapse replicas to a single optimal solution.
  • Implement the algorithm in a parameter server architecture with infrequent synchronization, making it suitable for heterogeneous systems.
  • Maintain identical hyperparameters across all experiments, avoiding additional tuning beyond standard SGD settings.

Experimental results

Research questions

  • RQ1Can we achieve faster convergence and better generalization in deep learning by reducing communication frequency in parallel SGD?
  • RQ2How does coupling multiple model replicas via a proximal term and entropy regularization improve generalization without increasing hyperparameter complexity?
  • RQ3To what extent can a model trained on partitioned data via Parle match or exceed the performance of full-batch SGD?
  • RQ4Does the use of local entropy and scoping enable stable convergence to flat minima in non-convex deep learning problems?
  • RQ5Can Parle scale efficiently across heterogeneous systems with varying compute and communication capabilities?

Key findings

  • Parle achieves a 2–4× wall-clock speedup over data-parallel SGD on CIFAR-10 with the All-CNN architecture, training in 75 minutes compared to 37 minutes for baseline SGD.
  • Parle attains a validation error of 5.18% on CIFAR-10 with full data, outperforming baseline SGD (6.15%) and Elastic-SGD (5.76%) under the same conditions.
  • Even when trained on only 50% of the data split across three replicas, Parle achieves 5.89% error—significantly better than SGD on the same subset (7.86%)—demonstrating robustness to data partitioning.
  • With six replicas each trained on 25% of the data, Parle achieves 6.08% error, while SGD on the same subset degrades to 10.96%, highlighting Parle’s ability to compensate for data sparsity.
  • Parle maintains state-of-the-art performance without introducing any new hyperparameters, unlike other methods such as Elastic-SGD or Entropy-SGD.
  • The algorithm is insensitive to hyperparameters: all experiments used the same settings, including weight decay of $10^{-3}$, dropout of 0.5, and data augmentation, confirming its robustness.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.