Skip to main content
QUICK REVIEW

[論文レビュー] Understanding Top-k Sparsification in Distributed Deep Learning

Shaohuai Shi, Xiaowen Chu|arXiv (Cornell University)|Nov 20, 2019
Stochastic Gradient Optimization Techniques参考文献 32被引用数 67
ひとこと要約

本論文は分散SGDにおける誤差補償付きのTop-k勾配疎化を分析し、ベル型勾配分布の下でTop-k演算子のより厳密な界を導出し、収束性を保ちながらGPU計算を高速化するGaussian-k近似Top-k法を提案する。

ABSTRACT

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top-$k$ sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top-$k$ operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random-$k$) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through extensive experiments. We then theoretically derive a tighter bound for the Top-$k$ operator. Finally, we exploit the property of gradient distribution to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Codes are available at: \url{https://github.com/hclhkbu/GaussianK-SGD}.

研究の動機と目的

  • Investigate why TopK-SGD converges well in practice despite conservative theoretical bounds.
  • Characterize gradient distributions during TopK-SGD training across diverse models and tasks.
  • Derive a tighter contraction bound for the Top-k operator than existing k/d bounds.
  • Propose an efficient approximate top-k selection algorithm that preserves convergence.
  • Demonstrate end-to-end training speedups using the proposed Gaussian_k method on GPU clusters.

提案手法

  • Empirically study local stochastic gradient coordinates and observe bell-shaped distributions across multiple models and tasks.
  • Derive a tighter bound: ||u - Top_k(u)||^2 <= (1 - k/d)^2 ||u||^2 under bell-shaped, convex π^2 distributions.
  • Convert the bound to a practical delta parameter: δ = (2kd - k^2)/d^2 for convergence analysis.
  • Propose Gaussian_k: an approximation of Top_k exploiting Gaussian-like gradient distributions to threshold selection efficiently on GPUs.
  • Benchmark Gaussian_k against Top_k, DGC_k, and Trimmed_topk in terms of computation time and scaling.
  • Validate convergence of GaussianK-SGD on CIFAR10 and ImageNet, comparing accuracy to TopK-SGD and Dense-SGD.

実験結果

リサーチクエスチョン

  • RQ1Why does TopK-SGD converge nearly as well as Dense-SGD despite weaker general sparsification bounds?
  • RQ2Do gradient coordinate distributions during training support a tighter Top-k contraction bound than 1 - k/d?
  • RQ3Can an approximate Top-k operator aligned with Gaussian-like gradients accelerate GPU computation without sacrificing convergence?
  • RQ4What are the end-to-end training speedups when adopting Gaussian_k on large-scale datasets and GPUs?

主な発見

  • TopK-SGD achieves convergence close to Dense-SGD across multiple models, while RandK-SGD can fail to converge on datasets like ImageNet.
  • Gradient coordinates under TopK-SGD exhibit bell-shaped (Gaussian-like) distributions with many near-zero values, enabling tighter analysis.
  • A theoretical bound using (1 - k/d)^2 yields a tighter contraction than previous (1 - k/d) bounds, explaining faster practical convergence of TopK-SGD.
  • Gaussian_k provides a GPU-friendly approximate top-k selection with comparable convergence to TopK-SGD and significantly faster end-to-end training.
  • GaussianK-SGD achieves up to 2.33x, 3.63x, and 1.51x speedups over Dense-SGD, TopK-SGD, and DGC-SGD respectively on a 16-GPU cluster with 10GbE.
  • End-to-end experiments show GaussianK-SGD maintains near TopK-SGD accuracy on CIFAR10 and ImageNet while delivering substantial speedups.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。