QUICK REVIEW

[论文解读] SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum

Jianyu Wang, Vinayak Tantia|arXiv (Cornell University)|Oct 1, 2019

Stochastic Gradient Optimization Techniques参考文献 42被引用 68

一句话总结

SlowMo 是一个通用的慢动量框架，位于基础分布式优化器之上（如 Local SGD、SGP），在不增加显著通信的情况下改善优化与泛化。它实现了对光滑非凸目标的收敛保证，在准确性上与基础方法相当或优于它们，同时保持效率。

ABSTRACT

Distributed optimization is essential for training large models on large datasets. Multiple approaches have been proposed to reduce the communication overhead in distributed training, such as synchronizing only after performing multiple local SGD steps, and decentralized methods (e.g., using gossip algorithms) to decouple communications among workers. Although these methods run faster than AllReduce-based methods, which use blocking communication before every update, the resulting models may be less accurate after the same number of updates. Inspired by the BMUF method of Chen & Huo (2016), we propose a slow momentum (SlowMo) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm. Experiments on image classification and machine translation tasks demonstrate that SlowMo consistently yields improvements in optimization and generalization performance relative to the base optimizer, even when the additional overhead is amortized over many updates so that the SlowMo runtime is on par with that of the base optimizer. We provide theoretical convergence guarantees showing that SlowMo converges to a stationary point of smooth non-convex losses. Since BMUF can be expressed through the SlowMo framework, our results also correspond to the first theoretical convergence guarantees for BMUF.

研究动机与目标

在保持或提高模型准确性的同时，降低分布式优化中的通信开销。
提供一个统一的 SlowMo 框架，通过定期同步和动量更新来增强基础优化器（SGD、SGP 等）。
为 SlowMo 在光滑的非凸目标上提供理论收敛保证。
在多种任务中展示图像分类和神经机器翻译的实证增益。

提出的方法

工作节点在两次通信之间运行一个基础优化器 tau 个本地步骤。
在 tau 步之后，工作节点通过 AllReduce 对参数进行平均，形成 x_{t,τ}。
应用慢动量更新：u_{t+1} = β u_t + (1/γ_t)(x_{t,0} - x_{t,τ})。
外部更新：x_{t+1,0} = x_{t,0} - α γ_t u_{t+1} 以传播动量。
通过相应选择参数，SlowMo 更新可以恢复 BMUF、Local SGD 和 Lookahead 作为特例。
理论结果表明，在标准假设下，对光滑非凸损失收敛到驻点，收敛率为 O(1/√(m T τ))。

实验结果

研究问题

RQ1SlowMo 是否在不同的基础分布式优化器（如 SGP、Local SGD、BMUF）上始终改善优化与泛化，同时保持通信效率？
RQ2SlowMo 针对光滑非凸目标的收敛保证是什么，参数（τ、α、β）如何影响性能？
RQ3与基线如 AR-SGD、SGP、OSGP 相比，SlowMo 在大规模视觉与语言任务上的表现如何？
RQ4在 SlowMo 变体中移除精确平均（如 SGP-SlowMo-noaverage）对性能和通信有何影响？
RQ5τ 的选择如何影响速度-精度权衡和跨任务的模型漂移？

主要发现

SlowMo 在与基线优化器如 SGP、OSGP、Local SGD 组合时，始终在 CIFAR-10、ImageNet 和 WMT’16 En-De 上同时改善训练损失和验证准确率/BLEU。
在 CIFAR-10 上，使用 SGP/OSGP/Local SGD 的 SlowMo 将验证准确率提升约 0.8–1.5 个百分点。
在 ImageNet 上，SlowMo 将 Local SGD 的 top-1 从 69.94% 提升到 73.24%，将 OSGP 的 74.96% 提升到 75.54%，同时每次迭代时间可比。
在 WMT’16 En-De，SlowMo 将 Local Adam/SGP 基线的 BLEU 从 26.62/26.92 提升到 27.14/27.84，分别。
SlowMo 实现了收敛速率 O(1/√(m T τ))，并在给定条件下随着工作者数量提供线性加速。
存在一个变体（SGP-SlowMo-noaverage）移除了精确平均步骤但仍达到类似性能，表明动量缓冲区同步在收益中起主导作用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。