QUICK REVIEW

[论文解读] Variance Reduced Local SGD with Lower Communication Complexity

Xianfeng Liang, Shuheng Shen|arXiv (Cornell University)|Dec 30, 2019

Advanced Image and Video Retrieval Techniques参考文献 37被引用 89

一句话总结

VRL-SGD 通过在 Local SGD 中引入方差减小来降低分布式非同质数据训练的通信开销，从而实现更低的通信复杂度和线性迭代加速。

ABSTRACT

To accelerate the training of machine learning models, distributed stochastic gradient descent (SGD) and its variants have been widely adopted, which apply multiple workers in parallel to speed up training. Among them, Local SGD has gained much attention due to its lower communication cost. Nevertheless, when the data distribution on workers is non-identical, Local SGD requires $O(T^{\frac{3}{4}} N^{\frac{3}{4}})$ communications to maintain its \emph{linear iteration speedup} property, where $T$ is the total number of iterations and $N$ is the number of workers. In this paper, we propose Variance Reduced Local SGD (VRL-SGD) to further reduce the communication complexity. Benefiting from eliminating the dependency on the gradient variance among workers, we theoretically prove that VRL-SGD achieves a \emph{linear iteration speedup} with a lower communication complexity $O(T^{\frac{1}{2}} N^{\frac{3}{2}})$ even if workers access non-identical datasets. We conduct experiments on three machine learning tasks, and the experimental results demonstrate that VRL-SGD performs impressively better than Local SGD when the data among workers are quite diverse.

研究动机与目标

激励在非同质数据分布下加速分布式 SGD 以减少通信。
开发一种 Local SGD 变体，在不增加额外假设的前提下缓解跨工作器的梯度方差。
建立在降低通信下的理论收敛性保证和线性加速。
展示在具有非同质数据分布的标准机器学习任务上的实际有效性。

提出的方法

引入 VRL-SGD，一种带有方差减小组件的 Local SGD 变体，用以对齐本地梯度和全局梯度。
计算梯度修正项 Delta_i 以近似跨通信周期的全局梯度差异。
用经 Delta_i 校正的随机梯度更新本地模型，以降低跨工作器方差。
在通信之间允许本地更新 k 步，以降低通信轮次。
给出理论收敛性分析，显示 O(T^{-1/2}N^{-1/2}) 的收敛速率以及改进的通信复杂度。
表明在非同质数据情形下，VRL-SGD 将通信复杂度从 O(T^{3/4}N^{3/4}) 降至 O(T^{1/2}N^{3/2})。

实验结果

研究问题

RQ1方差减少是否能够在非同质数据条件下消除 Local SGD 中工作器梯度方差的相关性？
RQ2与 Local SGD 和 S-SGD 相比，VRL-SGD 的通信复杂度和迭代加速是多少？
RQ3VRL-SGD 的保证是否扩展到非凸目标和同质数据情形？
RQ4在非同质数据任务（图像、文本、迁移学习）上的实证表现如何，相对于基线？

主要发现

VRL-SGD 在非同质数据下实现线性迭代加速，通信复杂度为 O(T^{1/2}N^{3/2})。
它不需要先前 Local SGD 分析中使用的有界梯度方差或同质数据假设。
在 MNIST、DBPedia 和 tiny ImageNet 上的实证结果表明在数据非同质时 VRL-SGD 优于 Local SGD；当数据同质时则与 S-SGD/Local SGD 相当。
理论结果表明对于非凸目标，在合适的学习率和通信周期设置下，收敛速率为 O(1/√(NT))。
一个热启动变体（VRL-SGD-W）可以减少对非 IID 初始化的依赖（C 项），并收敛性更紧致。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。