QUICK REVIEW

[论文解读] Communication Compression for Decentralized Training

Hanlin Tang, Shaoduo Gan|arXiv (Cornell University)|Mar 17, 2018

Stochastic Gradient Optimization Techniques被引用 185

一句话总结

论文提出两种量化的分散式SGD算法（DCD-PSGD 和 ECD-PSGD），在压缩交换的模型的同时保证收敛，在 CIFAR-10 与 ResNet-20 的实验中，在高延迟和低带宽条件下实现收敛速率为 O(1/√(nT)）并显示显著的加速。

ABSTRACT

Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: {\em communication compression} for low bandwidth networks, and {\em decentralization} for high latency networks. In this paper, We explore a natural question: {\em can the combination of both techniques lead to a system that is robust to both bandwidth and latency?} Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: unlike centralized algorithms, simply compressing exchanged information, even in an unbiased stochastic way, within the decentralized network would accumulate the error and fail to converge. In this paper, we develop a framework of compressed, decentralized training and propose two different strategies, which we call {\em extrapolation compression} and {\em difference compression}. We analyze both algorithms and prove both converge at the rate of $O(1/\sqrt{nT})$ where $n$ is the number of workers and $T$ is the number of iterations, matching the convergence rate for full precision, centralized training. We validate our algorithms and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with {\em both} high latency and low bandwidth.

研究动机与目标

促进行业健壮的分布式训练，结合去中心化与通信压缩以应对高延迟和低带宽网络。
开发两种经过压缩的分散式 SGD 算法（DCD-PSGD 和 ECD-PSGD），并确保收敛性保证。
提供理论收敛分析，在某些条件下达到与集中式训练相匹配的速率。
在具有挑战性的网络中，经验验证所提出的方法在性能上优于纯去中心化或纯量化的方法。

提出的方法

以 n 个节点和具有 Lipschitz 梯度的目标函数来形式化去中心化优化。
引入两种量化的分散式 SGD 算法：DCD-PSGD（差分压缩）和 ECD-PSGD（外推压缩）。
假设对称的双随机通信矩阵 W 具有谱间隙 ρ、梯度为 Lipschitz、梯度方差 σ² 和 ζ² 有界、以及信号与噪声比参数 α 的无偏随机压缩。
对于 DCD-PSGD，对差分 z_t^(i) = x_t^(i+1/2) − x_t^(i) 进行压缩并相应更新邻居模型的副本，通过定理 1 与推论确保收敛。
对于 ECD-PSGD，使用外推的 z 值来传输邻居的估计，并在有界压缩噪声的假设 2 下证明收敛，在与 DCD-PSGD 相当的速率下实现对更激进压缩的增强鲁棒性。
推导收敛速率：主项为 O(σ/√(nT))，并包含与 ζ、α、ρ、γ 相关的项；给出推论，得到总体速率为 O(1/√(nT))，并随着节点数的线性加速。

实验结果

研究问题

RQ1去中心化训练与无偏压缩的组合是否能够在不产生误差积累的前提下实现收敛？
RQ2针对压缩的分散式 SGD 可以确立哪些收敛速率，并与集中式和未压缩的去中心化基线相比如何？
RQ3两种提出的策略（差分压缩与外推压缩）在不同网络条件下的鲁棒性和性能有何差异？
RQ4在实际设定中，所提出的方法是否随工作节点数量呈现线性加速？

主要发现

两种经过压缩的分散式 SGD 算法（DCD-PSGD 和 ECD-PSGD）在收敛速率大致为 O(1/√(nT))。
ECD-PSGD 对激进压缩更具鲁棒性，而 DCD-PSGD 在节点间数据变化较大时可能具有更好的速率；然而过于激进的压缩可能导致 DCD-PSGD 收敛发散。
首要收敛项与集中式并行 SGD 相匹配，表明随节点数的线性加速。
理论结果有实验补充，显示在高延迟或低带宽网络中，去中心化、低精度训练可以超越 Allreduce。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。