QUICK REVIEW

[论文解读] An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems

Ahmed M. Abdelmoniem, Ahmed Elzanaty|arXiv (Cornell University)|Jan 26, 2021

Advanced Neural Network Applications参考文献 71被引用 31

一句话总结

SIDCo 引入了一种多阶段、基于阈值的梯度稀疏化方法，使用稀疏性诱导分布来在低开销下准确估计压缩阈值，从而加速分布式训练。

ABSTRACT

The recent many-fold increase in the size of deep neural networks makes efficient distributed training challenging. Many proposals exploit the compressibility of the gradients and propose lossy compression techniques to speed up the communication stage of distributed training. Nevertheless, compression comes at the cost of reduced model quality and extra computation overhead. In this work, we design an efficient compressor with minimal overhead. Noting the sparsity of the gradients, we propose to model the gradients as random variables distributed according to some sparsity-inducing distributions (SIDs). We empirically validate our assumption by studying the statistical characteristics of the evolution of gradient vectors over the training process. We then propose Sparsity-Inducing Distribution-based Compression (SIDCo), a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC) while being faster by imposing lower compression overhead. Our extensive evaluation of popular machine learning benchmarks involving both recurrent neural network (RNN) and convolution neural network (CNN) models shows that SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.

研究动机与目标

激发并解决分布式 DNN 训练中的通信瓶颈。
将梯度建模为稀疏性诱导分布以实现高效压缩。
开发一种带低开销的多阶段、基于阈值的压缩方案。
为 SIDCo 提供闭式阈值估计器以实现目标压缩率。
在 RNN 和 CNN 基准测试中展示加速和训练效率提升。

提出的方法

将梯度建模为从稀疏性诱导分布 (SID) 中抽取的随机变量：double exponential、double gamma 和 double generalized Pareto。
利用绝对梯度分布的反CDF推导达到目标压缩比的阈值。
提出单阶段阈值方案和多阶段估计器，以在激进稀疏化下提高远端尾部阈值精度。
使用多阶段 PoT (peak over threshold) 拟合方法，并为 exponential、gamma 与 GP 分布给出推论，来调整阈值。
提供自适应的 SIDCo 算法，选择阶段数 M 以界定估计误差。
分析收敛性，表明在有界压缩差异下 SIDCo 的收敛速度与 SGD 相匹配。

实验结果

研究问题

RQ1在保持收敛性能的同时，梯度压缩如何以最小的计算开销实现？
RQ2是否可以用稀疏性诱导分布有效建模梯度分布，以实现准确的阈值估计？
RQ3多阶段阈值估计器在跨体系结构的激进稀疏化中是否提高阈值精度？
RQ4在基于阈值的稀疏化下，SIDCo 的收敛性保证是什么？
RQ5SIDCo 在标准基准测试中取得的实际加速和质量权衡是什么？

主要发现

SIDCo 在无压缩、Topk、DGC 压缩器上的训练加速分别约为 41.7x、7.6x 和 1.9x。
在 GPU 和 CPU 设置下，带 SID 的单阶段阈值化可在较低开销下实现接近目标的压缩率，优于 Topk 或 DGC。
多阶段阈值估计在极度稀疏化（δ 非常小）时提高尾部精度。
在有界差异下，SIDCo 的收敛速度与 SGD 相匹配，意味着渐近收敛行为不损失。
在 RNN 和 CNN 基准上的实验结果显示出一致的性能提升并在各模型中实现准确的阈值估计。
SIDCo 的复杂度与模型规模线性相关，使得 GPU 并行实现具有可扩展性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。