QUICK REVIEW

[论文解读] Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

Shuai Zheng, Ziyue Huang|arXiv (Cornell University)|May 27, 2019

Advanced Neural Network Applications参考文献 28被引用 41

一句话总结

本文提出 dist-EF-SGD 及其块级变体，在双向梯度压缩、误差反馈与动量的条件下实现通信量约32×的降低，同时在非凸问题上保持收敛速率，在实际中与全精度分布式 SGD/SGDM 相匹配。

ABSTRACT

Communication overhead is a major bottleneck hampering the scalability of distributed machine learning systems. Recently, there has been a surge of interest in using gradient compression to improve the communication efficiency of distributed neural network training. Using 1-bit quantization, signSGD with majority vote achieves a 32x reduction on communication cost. However, its convergence is based on unrealistic assumptions and can diverge in practice. In this paper, we propose a general distributed compressed SGD with Nesterov's momentum. We consider two-way compression, which compresses the gradients both to and from workers. Convergence analysis on nonconvex problems for general gradient compressors is provided. By partitioning the gradient into blocks, a blockwise compressor is introduced such that each gradient block is compressed and transmitted in 1-bit format with a scaling factor, leading to a nearly 32x reduction on communication. Experimental results show that the proposed method converges as fast as full-precision distributed momentum SGD and achieves the same testing accuracy. In particular, on distributed ResNet training with 7 workers on the ImageNet, the proposed algorithm achieves the same testing accuracy as momentum SGD using full-precision gradients, but with $46\%$ less wall clock time.

研究动机与目标

激发并解决大规模深度学习中的分布式 SGD/SGDM 的通信瓶颈。
在参数服务器架构中开发带误差反馈的双向梯度压缩。
提出块级梯度压缩，以在保持收敛性的同时提高压缩质量。
建立 dist-EF-SGD 与 dist-EF-SGDM 在非凸目标上的理论收敛保证。
通过对 ResNet/ImageNet 与 CIFAR-100 的实验验证该方法，在准确性方面与全精度训练相当。

提出的方法

将 EF-SGD 扩展到分布式环境，在工作节点和服务器端都进行梯度压缩与误差反馈。
引入 dist-EF-SGD 及其块级变体 dist-EF-blockSGD，采用双向压缩和误差校正；包括针对步长变化对局部/全局误差项的重新标定。
在标准假设下给出收敛性分析；证明在非凸问题上达到 O(1/√(MT)) 速率，与完全精确的分布式 SGD 相匹配。
引入块级压缩器 C_B，将梯度分割成块并对每个块使用块特定的缩放因子进行压缩，以维持更高的 δ，从而实现约32×的通信降低。
扩展到带 Nesterov 动量的 dist-EF-blockSGDM；导出收敛性结果并讨论动量对压缩噪声的权衡。
可选地包含一个具有动量的变体，遵循类似的双向压缩框架。

实验结果

研究问题

RQ1在分布式参数服务器设置下，带误差反馈的双向梯度压缩是否能对非凸目标获得收敛性保证？
RQ2块级压缩相较于单次符号化方法对压缩质量和收敛性有何影响？
RQ3在非凸学习中，dist-EF-SGD 与 dist-EF-SGDM 在常数/递减/递增步长下的收敛速率是多少？
RQ4与标准的 1-bit 符号压缩相比，提出的块级压缩器如何提高 δ，从而改善收敛？
RQ5在大规模实验（例如 ImageNet 的 ResNet）中，所提出的方法在显著降低通信量的同时是否保持了准确性？

主要发现

dist-EF-SGD 在标准假设下达到 O(1/√(MT)) 收敛速率，与使用全精度梯度的分布式 SGD 相匹配。
结合 Nesterov 动量的 dist-EF-SGDM 同样达到 O(1/√(MT)) 收敛速率。
块级压缩器 C_B 是一个 φ(v)-近似压缩器，φ(v) ≥ min_b 1/d_b，使得几乎实现 32× 的通信降低。
实证结果显示测试准确性与全精度动量 SGD 相当，在 ImageNet/ResNet-50 上实现显著的实际墙钟时间节省（约 46% 更快）。
基于 SignSGD 的方法在所报告的实验中呈现较差的准确性，凸显了基于 EF 的方法以及块级 EF 方法的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。