QUICK REVIEW

[论文解读] Don't Use Large Mini-Batches, Use Local SGD

Tao Lin, Sebastian U. Stich|arXiv (Cornell University)|Aug 22, 2018

Advanced Neural Network Applications参考文献 87被引用 162

一句话总结

这篇论文表明大批量 SGD 在泛化方面表现不佳，并提出 post-local SGD 和 hierarchical local SGD 以改善泛化与效率，在标准基准上优于大批量基线。

ABSTRACT

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants.

研究动机与目标

Motivate the generalization issues associated with very large mini-batch SGD in distributed training.
Systematically study the trade-offs of local SGD across workers, local steps, and mini-batch sizes.
Propose post-local SGD to recover generalization while maintaining efficiency.
Propose hierarchical local SGD to optimize system resource use in heterogeneous hardware environments.

提出的方法

Define local SGD where each worker performs H local SGD updates with mini-batch B_loc before averaging (Eq. 2).
Compare local SGD to mini-batch SGD under scenarios of communication efficiency and generalization performance.
Introduce post-local SGD by switching from standard mini-batch SGD to local SGD after a phase t′, achieving large effective batch sizes with better generalization.
Propose hierarchical local SGD to apply local updates at multiple levels of a system hierarchy to optimize computation-communication trade-offs.
Relate local updates to stochastic noise injection and discuss implications for training dynamics and generalization.

实验结果

研究问题

RQ1Can local SGD match or exceed mini-batch SGD in time-to-accuracy under communication constraints?
RQ2Does local SGD improve generalization compared to large-batch SGD at the same effective batch size?
RQ3Does post-local SGD close the generalization gap associated with large batches without sacrificing efficiency?
RQ4How can hierarchical local SGD optimize resource use in heterogeneous computing environments?

主要发现

Local SGD can serve as a communication-efficient alternative to mini-batch SGD with favorable generalization on CIFAR-10/100 and ImageNet.
Post-local SGD closes the generalization gap of large-batch training and can achieve better generalization than both small and large batch baselines.
Post-local SGD provides at least 1.3× speedup over the whole training with improved generalization on CIFAR, and shows strong performance on ImageNet with large global batch sizes.
Local SGD scales better in time-to-accuracy than mini-batch SGD as the number of workers increases, due to fewer communication rounds.
Post-local SGD can be combined with sign-based compression to further improve communication efficiency without sacrificing accuracy.
Post-local SGD tends to reach flatter minima than large-batch SGD, contributing to improved generalization.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。