Skip to main content
QUICK REVIEW

[论文解读] Don't Use Large Mini-Batches, Use Local SGD

Tao Lin, Sebastian U. Stich|arXiv (Cornell University)|Aug 22, 2018
Advanced Neural Network Applications参考文献 87被引用 162
一句话总结

这篇论文表明大批量 SGD 在泛化方面表现不佳,并提出 post-local SGD 和 hierarchical local SGD 以改善泛化与效率,在标准基准上优于大批量基线。

ABSTRACT

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants.

研究动机与目标

  • Motivate the generalization issues associated with very large mini-batch SGD in distributed training.
  • Systematically study the trade-offs of local SGD across workers, local steps, and mini-batch sizes.
  • Propose post-local SGD to recover generalization while maintaining efficiency.
  • Propose hierarchical local SGD to optimize system resource use in heterogeneous hardware environments.

提出的方法

  • Define local SGD where each worker performs H local SGD updates with mini-batch B_loc before averaging (Eq. 2).
  • Compare local SGD to mini-batch SGD under scenarios of communication efficiency and generalization performance.
  • Introduce post-local SGD by switching from standard mini-batch SGD to local SGD after a phase t′, achieving large effective batch sizes with better generalization.
  • Propose hierarchical local SGD to apply local updates at multiple levels of a system hierarchy to optimize computation-communication trade-offs.
  • Relate local updates to stochastic noise injection and discuss implications for training dynamics and generalization.

实验结果

研究问题

  • RQ1Can local SGD match or exceed mini-batch SGD in time-to-accuracy under communication constraints?
  • RQ2Does local SGD improve generalization compared to large-batch SGD at the same effective batch size?
  • RQ3Does post-local SGD close the generalization gap associated with large batches without sacrificing efficiency?
  • RQ4How can hierarchical local SGD optimize resource use in heterogeneous computing environments?

主要发现

  • Local SGD can serve as a communication-efficient alternative to mini-batch SGD with favorable generalization on CIFAR-10/100 and ImageNet.
  • Post-local SGD closes the generalization gap of large-batch training and can achieve better generalization than both small and large batch baselines.
  • Post-local SGD provides at least 1.3× speedup over the whole training with improved generalization on CIFAR, and shows strong performance on ImageNet with large global batch sizes.
  • Local SGD scales better in time-to-accuracy than mini-batch SGD as the number of workers increases, due to fewer communication rounds.
  • Post-local SGD can be combined with sign-based compression to further improve communication efficiency without sacrificing accuracy.
  • Post-local SGD tends to reach flatter minima than large-batch SGD, contributing to improved generalization.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。