QUICK REVIEW

[论文解读] Asynchronous Decentralized Parallel Stochastic Gradient Descent

Xiangru Lian, Wei Zhang|arXiv (Cornell University)|Oct 18, 2017

Stochastic Gradient Optimization Techniques参考文献 52被引用 68

一句话总结

AD-PSGD 是一种无等待、异步去中心化的 SGD，在工作者数量增加时实现最优收敛速度 O(1/√K) 的线性加速，在异构环境中优于去中心化和集中基线。

ABSTRACT

Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a parameter server suffer from 1) communication bottleneck at parameter servers when workers are many, and 2) significantly worse convergence when the traffic to parameter server is congested. Can we design an algorithm that is robust in a heterogeneous environment, while being communication efficient and maintaining the best-possible convergence rate? In this paper, we propose an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations. Our theoretical analysis shows AD-PSGD converges at the optimal $O(1/\sqrt{K})$ rate as SGD and has linear speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a heterogeneous environment. When training ResNet-50 on ImageNet with up to 128 GPUs, AD-PSGD converges (w.r.t epochs) similarly to the AllReduce-SGD, but each epoch can be up to 4-8X faster than its synchronous counterparts in a network-sharing HPC environment. To the best of our knowledge, AD-PSGD is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale.

研究动机与目标

在异构环境中实现鲁棒、可扩展的分布式训练，避免中央瓶颈。
设计一个异步去中心化的 SGD，避免空闲时间和中央服务器瓶颈。
证明在最优速率下收敛，并随着工作者增多建立线性加速。
在大规模数据集（如 ImageNet）上进行实证评估，并展示相对于基线的实际加速。

提出的方法

每个工作节点维护一个本地模型，并用在一个小批量上计算的随机梯度来更新它。
节点执行异步本地更新，并通过一个双随机矩阵 W_k 随机将本地模型与相邻节点的模型进行平均。
全局更新可以写成 X_{k+1}=X_k W_k - γ ∂g(Ẋ_k; ξ_k^{i_k}, i_k)，其中 Ẋ_k = X_{k-τ_k} 表示有界的时延。
一种无死锁、无等待的实现使用双分图来安排邻居平均化并避免全局同步。
拓扑选项包括基于环的和多跳（对数级）连接，以加速信息传播和提高鲁棒性。
理论分析假设梯度是 Lipschitz、方差有界、谱隙 ρ、以及时延 T 有界，从而得到 O(1/√K) 的收敛速率和线性加速。

实验结果

研究问题

RQ1在没有中央参数服务器的情况下，异步去中心化训练是否可以实现收敛，同时保持有竞争力的速率？
RQ2在异构环境中，随着工作者数量增加，AD-PSGD 是否实现线性加速？
RQ3该算法对工作节点和链路在计算与通信速度上的异质性有多鲁棒？

主要发现

该算法以最优的 O(1/√K) 收敛速率收敛，与 SGD 和 D-PSGD 相匹配。
AD-PSGD 相对于工作者数量实现线性加速。
在异构环境中，实证显示 AD-PSGD 优于 AllReduce-SGD、D-PSGD 和 A-PSGD，通常达到数量级的优势。
在 ImageNet 上，最多 128 个 GPU 的情况下，AD-PSGD 就轮次收敛与 AllReduce-SGD 相近，但在网络共享的 HPC 环境中，每轮耗时可快 4–8×。
在同质、共享网络的集群中，AD-PSGD 在每轮运行时间方面仍比同步基线快 4×–8×。
AD-PSGD 通过将慢节点对系统的影响局部化，展示出对慢工作节点和慢网络链路的强鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。