QUICK REVIEW

[论文解读] Variance-based Gradient Compression for Efficient Distributed Deep Learning

Yusuke Tsuzuku, Hiroto Imachi|arXiv (Cornell University)|Feb 16, 2018

Advanced Neural Network Applications参考文献 15被引用 50

一句话总结

介绍基于方差的梯度压缩，通过延迟不太可能的梯度更新来大幅降低分布式训练的通信量，在可比精度下实现高压缩并且与其他方法兼容。

ABSTRACT

Due to the substantial computational cost, training state-of-the-art deep neural networks for large-scale datasets often requires distributed training using multiple computation workers. However, by nature, workers need to frequently communicate gradients, causing severe bottlenecks, especially on lower bandwidth connections. A few methods have been proposed to compress gradient for efficient communication, but they either suffer a low compression ratio or significantly harm the resulting model accuracy, particularly when applied to convolutional neural networks. To address these issues, we propose a method to reduce the communication overhead of distributed deep learning. Our key observation is that gradient updates can be delayed until an unambiguous (high amplitude, low variance) gradient has been calculated. We also present an efficient algorithm to compute the variance with negligible additional cost. We experimentally show that our method can achieve very high compression ratio while maintaining the result model accuracy. We also analyze the efficiency using computation and communication cost models and provide the evidence that this method enables distributed deep learning for many scenarios with commodity environments.

研究动机与目标

激发并解决数据并行分布式深度学习中的通信瓶颈。
提出一种利用梯度方差来决定何时发送更新的梯度压缩方法。
实现高压缩比而不牺牲模型精度，并展示与其他压缩方案的兼容性。
提供对 CIFAR-10 与 ImageNet 的分析和实证结果，以展示在通用网络上的实用性。

提出的方法

提出基于方差准则推迟发送模糊（信噪比低）的梯度分量。
使用阈值准则：alpha' / |B| * V_B[∇_i f_z(x)] < (∇_i f_B(x))^2 来决定是否发送一个梯度分量。
维护梯度和梯度平方和的和以高效计算该准则，且不产生额外成本。
将发送的分量量化为 4 位，含 1 个符号位和 3 个指数位，并对参数索引进行编码以实现稀疏通信。
为稀疏梯度通信使用 allgatherv，以避免在 allreduce 过程中的重复编码/解码。
可选地与 Strom 的稀疏化方法或 QSGD 结合以进一步压缩（混合方法）。
通过推导准则的实用形式以及方差（zeta）的更新-衰减机制，提供高效实现。

实验结果

研究问题

RQ1在分布式深度学习中，使用基于方差的准则可以将梯度通信减少到多少？
RQ2在像 ImageNet 这样的大规模任务中，基于方差的梯度压缩是否能够在实现高压缩的同时维持准确性？
RQ3所提出的方法如何与现有的压缩技术（量化、稀疏化）相互作用并互补？
RQ4在普通硬件上实现基于方差的梯度压缩的实际计算与通信成本是多少？

主要发现

方法	准确率	压缩比
Adam, 无压缩	88.1	1
Adam, Strom, tau=0.001	62.8	88.5
Adam, Strom, tau=0.01	85.0	230.1
Adam, Strom, tau=0.1	88.0	6,942.8
Adam, 我们的方法, alpha=1	88.9	120.7
Adam, 我们的方法, alpha=1.5	88.9	453.3
Adam, 我们的方法, alpha=2.0	88.9	913.4
Adam, 混合式, tau=0.01, alpha=2.0	85.0	1,942.2
Adam, 混合式, tau=0.1, alpha=2.0	88.2	12,822.4
Adam, QSGD (2bit, d=128)	88.8	12.3
Adam, QSGD (3bit, d=512)	87.4	14.4
Adam, QSGD (4bit, d=512)	88.2	11.0
Momentum SGD, 无压缩	91.7	1
Momentum SGD, Strom, tau=0.001	84.8	6.6
Momentum SGD, Strom, tau=0.01	10.6	990.7
Momentum SGD, Strom, tau=0.1	71.6	8,485.0
Momentum SGD, 我们的方法, alpha=1	?	?
Momentum SGD, 我们的方法, alpha=1.5	?	?
Momentum SGD, 我们的方法, alpha=2.0	?	?
Momentum SGD, 混合式, tau=0.01, alpha=2.0	87.6	983.9
Momentum SGD, 混合式, tau=0.1, alpha=2.0	87.1	12,396.8
Momentum SGD, QSGD (2bit, d=128)	90.8	6.6
Momentum SGD, QSGD (3bit, d=512)	91.4	7.0
Momentum SGD, QSGD (4bit, d=512)	91.7	4.0

在 CIFAR-10 上使用 Adam 时实现非常高的压缩比，且精度相当或有所提升；在 Momentum SGD 下实现强压缩。
在 CIFAR-10 上，我们的方法在 alpha ∈ {1,1.5,2.0} 时达到约 88.9% 的准确率，同时显著降低通信量（alpha=2.0 时，对 Adam 可达 913.4x，对 Momentum SGD 可达 383.6x）。
混合方法（基于方差的+ Strom）在多种设置下实现较大压缩且精度损失很小，优于 Strom 的独立方法。
在 ImageNet（ResNet-50）上，基于方差的方法的准确率接近基于量化的方法，并且实现了可观的压缩（例如 alpha=2.0 在 Momentum SGD 下达到 75.1%-75.5% 的准确率，压缩比 990.7x–5,173.8x）。
基于方差的压缩使在普通互连上的可扩展分布式训练成为可能，基于 allgatherv 的通信从高压缩比中受益。）

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。