QUICK REVIEW

[论文解读] Natural Compression for Distributed Deep Learning

Samuel Horváth, Chen-Yu Ho|arXiv (Cornell University)|May 27, 2019

Stochastic Gradient Optimization Techniques参考文献 44被引用 69

一句话总结

论文介绍自然压缩 C_nat，通过随机舍入将每个更新条目四舍五入到最近的二的幂，在可忽略的对收敛影响下实现显著的通信节省，并扩展到自然抖动以实现更激进的压缩，对标准抖动有指数级的改进。

ABSTRACT

Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practically effective compression technique: natural compression (NC). Our technique is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa. We show that compared to no compression, NC increases the second moment of the compressed vector by not more than the tiny factor $\frac{9}{8}$, which means that the effect of NC on the convergence speed of popular training algorithms, such as distributed SGD, is negligible. However, the communications savings enabled by NC are substantial, leading to $3$-$4 imes$ improvement in overall theoretical running time. For applications requiring more aggressive compression, we generalize NC to natural dithering, which we prove is exponentially better than the common random dithering technique. Our compression operators can be used on their own or in combination with existing operators for a more aggressive combined effect and offer new state-of-the-art both in theory and practice.

研究动机与目标

动机与解决数据并行分布式深度学习中的通信瓶颈。
提出一个简单、无偏的压缩算子，并证明其方差可控。
证明压缩在对收敛没有显著放慢的前提下带来可观的通信节省。
引入自然抖动以实现更激进的压缩并分析其理论收益。
展示实际性能提升以及与现有压缩方法的兼容性。

提出的方法

定义并实现自然压缩 C_nat，使每个实值更新条目通过无偏舍入映射到一个随机的二的幂。
证明 C_nat 属于无偏且二阶矩有界的类别 B(1/8)，确保对收敛的影响可忽略（定理 2.3）。
展示自然压缩通过仅编码符号位和指数位在 IEEE 754 格式中来降低通信量（float32 少 3.56 倍比特，float64 少 5.82 倍）。
引入自然抖动 D_nat^{p,s}，作为相对于标准抖动的指数级改进，并证明其方差和压缩特性（定理 3.2、3.3）。
为分布式 SGD（算法 1）开发一个带有主节点和工作节点的双向压缩框架，在 B(ω) 中使用压缩以实现加速（定理 4.1）。
通过组合规则证明与现有压缩算子的兼容性（定理 2.5）。
提供一个概念验证系统和实验，用于验证训练时间的减少及可扩展性（CIFAR-10 上的 ResNet110、AlexNet；ImageNet 结果）。

实验结果

研究问题

RQ1自然压缩对更新向量的二阶矩增加到多少，以及它是否会对收敛产生有意义的影响？
RQ2采用 C_nat 与自然抖动的双向压缩是否能在分布式 SGD 中提供实际的加速，同时保持准确性？
RQ3将自然压缩与现有压缩技术结合时，有哪些理论保证和实际效益？
RQ4在固定通信预算下，自然抖动在方差和效率方面与标准抖动相比如何？

主要发现

C_nat 将二阶矩增加至多为 9/8 倍，对基于 SGD 的方法的收敛影响可忽略。
C_nat 在双向压缩下提供每次迭代通信量减少约 3.2×–3.6×。
在相同方差水平下，自然抖动 D_nat^{p,s} 比标准抖动在指数级上更优。
结合稀疏化或其他算子时，自然压缩比标准方法带来更大的总体加速（如表 1 讨论所示）。
实证结果显示显著的训练时间减少（例如 CIFAR-10 上的 ResNet110 约 26%，AlexNet约 66%），且最终准确率不下降，并在更大模型如 ImageNet 上实现成功的可扩展性。
所提议的算子与 SwitchML 风格的就地网络聚合兼容，并支持 B(ω) 内的一大类压缩算子。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。