QUICK REVIEW

[论文解读] Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization

Farzin Haddadpour, Mohammad Mahdi Kamani|arXiv (Cornell University)|Oct 30, 2019

Reinforcement Learning in Robotics被引用 92

一句话总结

论文在 Polyak-Łojasiewicz (PL) 条件下，对 Local SGD 进行周期性模型平均的收敛分析进行加强，展示在 O((pT)^{1/3}) 通信轮数下的线性加速，并引入自适应同步方案。

ABSTRACT

Communication overhead is one of the key challenges that hinders the scalability of distributed optimization algorithms. In this paper, we study local distributed SGD, where data is partitioned among computation nodes, and the computation nodes perform local updates with periodically exchanging the model among the workers to perform averaging. While local SGD is empirically shown to provide promising results, a theoretical understanding of its performance remains open. We strengthen convergence analysis for local SGD, and show that local SGD can be far less expensive and applied far more generally than current theory suggests. Specifically, we show that for loss functions that satisfy the Polyak-Łojasiewicz condition, $O((pT)^{1/3})$ rounds of communication suffice to achieve a linear speed up, that is, an error of $O(1/pT)$, where $T$ is the total number of model updates at each worker. This is in contrast with previous work which required higher number of communication rounds, as well as was limited to strongly convex loss functions, for a similar asymptotic performance. We also develop an adaptive synchronization scheme that provides a general condition for linear speed up. Finally, we validate the theory with experimental results, running over AWS EC2 clouds and an internal GPU cluster.

研究动机与目标

动机与分析分布式经验风险最小化，使用本地 SGD 与周期性平均，以减少通信开销。
给出更紧的收敛界，使在 PL 条件下的非凸问题也能实现线性加速。
引入自适应同步方案以决定批量大小/通信频率。
通过在 AWS EC2 与 GPU 集群上的实验来验证理论结果。

提出的方法

模型在固定的平均期 tau 内本地进行更新，随后进行模型平均的通信轮次（LUPA-SGD(tau)）。
假设具有有界方差的无偏随机梯度，以及 L-光滑性，再加上 Polyak-Łojasiewicz (PL) 条件。
推导收敛性界，在 tau = O(T^{2/3}/p^{1/3}) 下，E[F(x_bar^{(T)})-F*] = O(1/(pBT))。
提出 ADA-LUPA-SGD，通过基于当前目标差 F(x_bar^{(i tau_0)})-F* 自适应选择 tau_i 以保持线性加速。
与先前的 local-SGD 分析进行比较，并解释为何较弱的假设能得到更紧的收敛率。

实验结果

研究问题

RQ1在非凸 PL 条件下，带周期平均的 Local SGD 是否能够在更少的通信轮次下实现线性加速？
RQ2维持线性加速所能达到的本地更新次数 tau 的最紧界限是什么？
RQ3自适应同步方案是否在保留理论保证的同时提升实际性能？
RQ4如 PL 条件和光滑性等假设，与有界梯度/方差假设在获得更快收敛方面有何比较？
RQ5云端与 GPU 集群上的经验结果是否与理论增益一致？

主要发现

在 PL 下的非凸目标中，O((pT)^{1/3}) 的通信轮数足以实现线性加速，误差为 O(1/(pT)。
当 tau = O(T^{2/3}/p^{1/3}) 且固定迷你批次 B 时，该方法达到 O(1/(pBT)) 的误差。
自适应同步方案（ADA-LUPA-SGD）在合理条件下保持线性加速，并且可以优于固定周期性平均。
去除有界梯度假设使应用范围更广，并且仍然比以往工作具有更高的通信效率。
在 AWS EC2 和内部 GPU 集群上的实验验证了理论改进并显示实际加速。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。