QUICK REVIEW

[论文解读] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Yujun Lin, Song Han|arXiv (Cornell University)|Dec 5, 2017

Advanced Neural Network Applications参考文献 37被引用 645

一句话总结

Deep Gradient Compression 通过动量修正、局部梯度裁剪、动量因子屏蔽和暖身训练，在保留 CNN 和 RNN 精度的前提下，将梯度通信量降低 270× 到 600×。

ABSTRACT

Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile. Code is available at: https://github.com/synxlin/deep-gradient-compression.

研究动机与目标

需要在大规模训练的同步分布式 SGD 中降低通信带宽的需求。
提出一种在高稀疏性下仍能保持精度的梯度压缩方法。
引入机制以缓解由稀疏更新引起的收敛性和陈旧性问题。

提出的方法

梯度稀疏化以仅传输较大梯度，同时在本地累积较小梯度。
使用 32 位非零元素和 16 位的零的运行长度对稀疏梯度进行编码。
动量修正以使稀疏更新与密集动量 SGD 更新对齐。
局部梯度裁剪以在每个节点内限制爆炸风险。
动量因子屏蔽以降低来自延迟梯度的陈旧动量影响。
暖身训练以逐步增加稀疏性并稳定早期训练。

实验结果

研究问题

RQ1在多任务中能否在不损失准确性的情况下显著降低梯度交换量（数量级级别）？
RQ2在具有动量的分布式 SGD 中，如何缓解由稀疏性引起的收敛问题？
RQ3哪种技术组合在带宽降低和模型性能之间在 CNN 与 RNN 上达到最佳权衡？
RQ4哪些运行时策略（例如分层阈值化）实现可扩展的稀疏梯度选择？

主要发现

在所有任务和数据集上都能实现 270× 到 600× 的梯度压缩比且不损失精度。
在 ImageNet 上使用 ResNet-50，压缩达到 277×，相比基线几乎不损失精度（Top-1：基线 58.17% 对比 DGC 58.20%；Top-5：基线 80.19% 对比 DGC 80.20%）。
在 CIFAR-10 的 ResNet-110 使用 4 GPU 时，基线 Top-1 精度为 93.75%，DGC 达到 93.87%（+0.12%）。
在 ImageNet 使用 4 GPU 总批量大小为 256 时，基线 Top-1 为 92.92%，DGC 为 93.28%（+0.37%）。
在 Penn Treebank 的语言建模中，困惑度为 72.30（基线）对 72.24（DGC），梯度大小为 0.42 MB（压缩 462×）。
在 LibriSpeech 的语音识别中，测试集清洁集的 WER 为 9.45%（基线）对比 9.06%（DGC），梯度大小为 0.74 MB（压缩 608×）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。