QUICK REVIEW

[论文解读] On Biased Compression for Distributed Learning

Aleksandr Beznosikov, Samuel Horváth|arXiv (Cornell University)|Feb 27, 2020

Stochastic Gradient Optimization Techniques参考文献 32被引用 48

一句话总结

本文分析分布式学习中的带偏梯度压缩算子，证明带误差反馈时的线性收敛性，并在单节点和多节点设置下比较带偏与无偏压缩器，同时提出三种带偏压缩器类别及新的算子。

ABSTRACT

In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their performance when applied to (stochastic) gradient descent and distributed (stochastic) gradient descent. We show for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings. We prove that distributed compressed SGD method, employed with error feedback mechanism, enjoys the ergodic rate $O\left( δL \exp \left[-\frac{μK}{δL} ight] + \frac{(C + δD)}{Kμ} ight)$, where $δ\ge 1$ is a compression parameter which grows when more compression is applied, $L$ and $μ$ are the smoothness and strong convexity constants, $C$ captures stochastic gradient noise ($C=0$ if full gradients are computed on each node) and $D$ captures the variance of the gradients at the optimum ($D=0$ for over-parameterized models). Further, via a theoretical study of several synthetic and empirical distributions of communicated gradients, we shed light on why and by how much biased compressors outperform their unbiased variants. Finally, we propose several new biased compressors with promising theoretical guarantees and practical performance.

研究动机与目标

动机化并形式化将带偏压缩作为在分布式学习中降低通信量的工具。
引入三类带参数的带偏压缩器，并将其与无偏压缩器联系起来。
在单节点和分布式设置下、带误差反馈的情况下，为带偏梯度方法建立收敛性保证。
在不同数据分布下，探索带偏压缩器何时优于无偏对手。
提出具有理论保证和实际性能的新型带偏压压缩器。

提出的方法

定义三类带偏压缩器：B^1(α,β)、B^2(γ,β)、B^3(δ)，并将它们与无偏 U(ζ) 联系起来。
证明这些压缩器类别之间的等价性和尺度性（定理6）。
给出在单节点设置下、针对每一类的带偏压缩的梯度下降收敛速率（定理17–19，表1）。
分析缩放对收敛速率的影响，并在分布假设下比较带偏与无偏的性能。
将分析推广到带误差反馈的分布式 SGD，并给出多种调度下的遍历收敛速率（定理21，表2）。
对广泛的带偏和无偏压缩器进行调查和分类至三类（表3）。

实验结果

研究问题

RQ1带偏压缩算子在单节点和分布式设置下的SGD/梯度方法中是否能够实现线性收敛？
RQ2在梯度分量的不同统计分布下，带偏压缩器与无偏压缩器的比较如何？
RQ3在标准的光滑/强凸假设下，带偏压压缩器的具体收敛速率和复杂度是多少？
RQ4误差反馈如何在分布式学习中使带偏压缩器实现稳定收敛？
RQ5可以设计哪些具有可证明保证和实际效果的新型带偏压压缩器？

主要发现

在结合误差反馈时，带偏压缩器在单节点和分布式设置中可以实现线性收敛。
定义了三类带偏压缩器，定理给出精确的收敛速率：表1在每一类下总结 CGD 的复杂度。
带误差反馈的分布式 SGD 具有遍历收敛性，速率取决于 δ、μ、L 和 K（表2）。
等价性结果展示带偏类别如何与无偏压缩器相关联并能够模拟无偏压缩器，指导参数选择与缩放（定理6）。
提出并分类了若干新的带偏压缩器（表3），展示具有理论保证的实际备选方案。
分析表明，在某些梯度分布下，带偏压缩器相对于无偏变体具有潜在的经验优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。