QUICK REVIEW

[论文解读] On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Hao Yu, Rong Jin|arXiv (Cornell University)|May 9, 2019

Stochastic Gradient Optimization Techniques被引用 147

一句话总结

本文证明，在温和假设下，并行重启带动量的随机梯度下降在收敛性上达到与香草分布式 SGD 相同的 O(1/√(NT)) 收敛速率（线性加速），同时显著减少通信轮数。

ABSTRACT

Recent developments on large-scale distributed machine learning applications, e.g., deep neural networks, benefit enormously from the advances in distributed non-convex optimization techniques, e.g., distributed Stochastic Gradient Descent (SGD). A series of recent works study the linear speedup property of distributed SGD variants with reduced communication. The linear speedup property enable us to scale out the computing capability by adding more computing nodes into our system. The reduced communication complexity is desirable since communication overhead is often the performance bottleneck in distributed systems. Recently, momentum methods are more and more widely adopted in training machine learning models and can often converge faster and generalize better. For example, many practitioners use distributed SGD with momentum to train deep neural networks with big data. However, it remains unclear whether any distributed momentum SGD possesses the same linear speedup property as distributed SGD and has reduced communication complexity. This paper fills the gap by considering a distributed communication efficient momentum SGD method and proving its linear speedup property.

研究动机与目标

推动在分布式非凸优化中使用带动量的 SGD 实现线性加速的研究。
分析带动量的并行重启 SGD（PR-SGD-Momentum），并建立其收敛性与通信效率。
展示如何在降低通信开销的同时，将动量整合进来并保持收敛速度。
比较动量的不同变体（Polyak 与 Nesterov），并证明在所提出的框架下它们具有相似的收敛性质。

提出的方法

在光滑性和有界方差/异质性假设1下研究问题(1)。
提出带动量的并行重启 SGD（算法1），提供两种动量选项（Polyak 与 Nesterov）。
证明在定期聚合梯度时，节点平均迭代 bar{x}^{(t)} 遵循类似动量 SGD 的动态。
推导选项I（Polyak 的动量）的收敛界，显示对学习率 γ、动量 β 和同步间隔 I 的依赖。
将分析扩展到选项II（Nesterov 的动量），并给出相似的收敛速率结果。
(i) 使用 γ = √N/√T 且 I=1 时实现线性加速，(ii) 在同质数据下实现 O(N^{3/2}T^{1/2}) 次通信轮次的降低通信以获得线性加速，以及在非同质数据下实现 O(N^{3/4}T^{3/4}) 次通信轮次。

实验结果

研究问题

RQ1在非凸情形中，分布式带动量 SGD 是否能实现与不带动量的分布式 SGD 相同的线性加速（O(1/√(NT))）？
RQ2通信间隔 I 如何影响收敛性，是否可以在不牺牲加速的前提下降低？
RQ3在所提出的框架下，Polyak 与 Nesterov 的动量是否具有相同数量级的收敛性？
RQ4在同质数据与非同质数据情形下，基于动量的分布式训练的通信轮次复杂度是多少？
RQ5去中心化通信如何影响线性加速属性？

主要发现

在假设1下，若选择合适的 γ 与 I，PR-SGD-Momentum 实现 O(1/√(NT)) 收敛，即线性加速。
在同质数据（κ=0）时，进行 T 次迭代需要 O(N^{3/2}T^{1/2}) 次通信轮次以保持线性加速。
在非同质数据（κ>0）时，进行 T 次迭代需要 O(N^{3/4}T^{3/4}) 次通信轮次以实现线性加速。
Polyak 与 Nesterov 的动量在常数差异下收敛速率相同，因而具有相同的线性加速特性。
去中心化通信（算法2）在假设1和假设2下也实现线性加速，给定合适的 γ，在标准混合条件（ρ）下表现出 O(1/√(NT)) 的收敛性。
在 CIFAR-10 的 ResNet-56 上的实验验证了更快的收敛，并展示了省略通信的动量方法的实际好处。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。