QUICK REVIEW

[论文解读] TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

Wei Wen, Cong Xu|arXiv (Cornell University)|May 22, 2017

Robotics and Automated Systems被引用 419

一句话总结

TernGrad 将梯度量化为三值（-1，0，1），以在分布式数据并行训练中减少通信，具有收敛保证和分层技术来提升性能；实验表明准确率损失极小甚至无损并显著加速。

ABSTRACT

High network communication cost for synchronizing gradients and parameters is the well-known bottleneck of distributed training. In this work, we propose TernGrad that uses ternary gradients to accelerate distributed deep learning in data parallelism. Our approach requires only three numerical levels {-1,0,1}, which can aggressively reduce the communication time. We mathematically prove the convergence of TernGrad under the assumption of a bound on gradients. Guided by the bound, we propose layer-wise ternarizing and gradient clipping to improve its convergence. Our experiments show that applying TernGrad on AlexNet does not incur any accuracy loss and can even improve accuracy. The accuracy loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a performance model is proposed to study the scalability of TernGrad. Experiments show significant speed gains for various deep neural networks. Our source code is available.

研究动机与目标

促使并解决数据并行深度学习中的分布式 SGD 通信瓶颈。
提出一种三值梯度量化方法以替代全精度梯度同步。
构建在理论上有依据的界限和实用技术以确保收敛与稳定性。
在标准深度神经网络上展示经验上的准确性保持（或提升）并衡量可扩展性与加速。

提出的方法

使用基于梯度大小引导的随机伯努利掩码将梯度量化为三值。
应用一个共享标量 s_t 以自适应三值并实现无偏梯度估计。
使用参数局部化以将服务器端参数同步替换为拉取量化梯度。
引入分层三值化和梯度裁剪以收紧收敛界并减少梯度范围。
提供收敛分析，表明在标准在线梯度条件和梯度界限下几乎必然收敛。

实验结果

研究问题

RQ1三值梯度量化能否确保分布式 SGD 的收敛？
RQ2分层三值化和梯度裁剪如何影响收敛和实际性能？
RQ3使用 TernGrad 在标准卷积神经网络架构上可以达到的准确性和加速是什么？
RQ4TernGrad 如何随工作节点数和网络带宽扩展？

主要发现

在提出的假设和三值梯度估计下，TernGrad 将极其接近的最小值几乎必然收敛。
分层三值化和梯度裁剪在实践中收紧了收敛界并提高了稳定性。
AlexNet 在 TernGrad 下没有精度损失，甚至可能提升；GoogLeNet 的平均 Top-1 损失小于约 2%。
实证结果表明由于通信降低，训练速度显著提升，特别是对通信与计算比高的网络。
一个性能模型表明在多GPU集群及不同带宽下可观的吞吐量提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。