QUICK REVIEW

[论文解读] Understanding Top-k Sparsification in Distributed Deep Learning

Shaohuai Shi, Xiaowen Chu|arXiv (Cornell University)|Nov 20, 2019

Stochastic Gradient Optimization Techniques参考文献 32被引用 67

一句话总结

本文分析分布式 SGD 中带误差补偿的 Top-k 稀疏化，推导了在梯度分布呈钟形时对 Top-k 运算的更紧界，并提出高斯-k 近似 Top-k 方法以在保持收敛的同时加速 GPU 计算。

ABSTRACT

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top-$k$ sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top-$k$ operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random-$k$) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through extensive experiments. We then theoretically derive a tighter bound for the Top-$k$ operator. Finally, we exploit the property of gradient distribution to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Codes are available at: \url{https://github.com/hclhkbu/GaussianK-SGD}.

研究动机与目标

研究为什么在实践中 TopK-SGD 收敛良好，尽管理论界限较为保守。
描述在不同模型和任务中，TopK-SGD 训练过程中的梯度分布。
推导比现有的 k/d 边界更紧的 Top-k 操作收缩界。
提出一种高效的近似 Top-k 选择算法，能够保持收敛性。
在 GPU 集群上展示所提 Gaussian_k 方法的端到端训练加速。

提出的方法

通过实证研究局部随机梯度坐标，观察多种模型和任务中的钟形分布。
在钟形、凸 π^2 分布下，导出更紧的界：||u - Top_k(u)||^2 <= (1 - k/d)^2 ||u||^2。
将该界转化为一个可实践的 δ 参数：δ = (2kd - k^2)/d^2，用于收敛性分析。
提出 Gaussian_k：利用高斯样梯度分布近似 Top_k，以在 GPU 上高效进行阈值选择。
在计算时间和伸缩性方面对 Gaussian_k、Top_k、DGC_k和 Trimmed_topk 进行基准比较。
在 CIFAR10 和 ImageNet 上验证 GaussianK-SGD 的收敛，并将精度与 TopK-SGD 及 Dense-SGD 进行比较。

实验结果

研究问题

RQ1为什么尽管全局稀疏化界限较弱，TopK-SGD 的收敛仍接近 Dense-SGD？
RQ2在训练过程中梯度坐标分布是否支持比 1 - k/d 更紧的 Top-k 收缩界？
RQ3一个与高斯样梯度对齐的近似 Top-k 运算符是否能够在不牺牲收敛性的情况下加速 GPU 计算？
RQ4在大规模数据集和 GPU 上采用 Gaussian_k 能实现的端到端训练加速是多少？

主要发现

在多种模型上，TopK-SGD 的收敛接近 Dense-SGD；而 RandK-SGD 可能在如 ImageNet 的数据集上无法收敛。
在 TopK-SGD 下梯度坐标呈钟形（高斯样）分布，存在大量接近零的值，便于进行更紧的分析。
使用 (1 - k/d)^2 的理论界比以前的 (1 - k/d) 界更紧，解释了 TopK-SGD 在实际中的更快收敛。
Gaussian_k 提供一种对 GPU 友好的近似 Top-k 选择，收敛性与 TopK-SGD 相当，端到端训练显著更快。
GaussianK-SGD 在一个16-GPU、10GbE的集群上，分别比 Dense-SGD、TopK-SGD 和 DGC-SGD 提升了至多 2.33x、3.63x 和 1.51x 的速度。
端到端实验表明 GaussianK-SGD 在 CIFAR10 和 ImageNet 上保持接近 TopK-SGD 的精度，同时带来显著的加速。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。