Skip to main content
QUICK REVIEW

[论文解读] Asynchronous Stochastic Gradient Descent with Delay Compensation

Shuxin Zheng, Qi Meng|arXiv (Cornell University)|Sep 27, 2016
Advanced Neural Network Applications参考文献 26被引用 160
一句话总结

本文提出 Delay Compensated ASGD (DC-ASGD),一种利用泰勒展开并结合廉价 Hessian 近似来补偿异步 SGD 中的延迟梯度的方法,在保持 ASGD 效率的同时实现接近顺序 SGD 的收敛。

ABSTRACT

With the fast development of deep learning, it has become common to learn big neural networks using massive training data. Asynchronous Stochastic Gradient Descent (ASGD) is widely adopted to fulfill this task for its efficiency, which is, however, known to suffer from the problem of delayed gradients. That is, when a local worker adds its gradient to the global model, the global model may have been updated by other workers and this gradient becomes "delayed". We propose a novel technology to compensate this delay, so as to make the optimization behavior of ASGD closer to that of sequential SGD. This is achieved by leveraging Taylor expansion of the gradient function and efficient approximation to the Hessian matrix of the loss function. We call the new algorithm Delay Compensated ASGD (DC-ASGD). We evaluated the proposed algorithm on CIFAR-10 and ImageNet datasets, and the experimental results demonstrate that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly approaches the performance of sequential SGD.

研究动机与目标

  • Motivate and address the problem of delayed gradients in ASGD for training deep neural networks.
  • Develop a delay compensation mechanism based on Taylor expansion and a scalable Hessian approximation.
  • Propose an implementable DC-ASGD algorithm with (-diagonal) Hessian approximation and analyze its convergence.
  • Empirically validate DC-ASGD on CIFAR-10 and ImageNet against ASGD, SSGD, and sequential SGD.
  • Demonstrate improved convergence speed and accuracy close to sequential SGD while maintaining ASGD efficiency.

提出的方法

  • Formulate gradient delay in ASGD and identify zero-order nature of delayed gradients via Taylor expansion.
  • Use an inexpensive Hessian approximation based on the outer product of gradients and a diagonalization trick to reduce storage (Diag(λG)).
  • Derive a delay-compensated gradient g(w_t) + λ g(w_t) ⊙ g(w_t) ⊙ (w_t+τ − w_t) and update the global model accordingly (Eq. 10).
  • Propose two implementation variants: DC-ASGD-c (constant λ) and DC-ASGD-a (adaptive λ via MeanSquare tracking).
  • Provide convergence theory for non-convex neural nets under bounded delay with ergodic rate O(1/√T) and discuss delay tolerance.
  • Experimentally evaluate on CIFAR-10 (ResNet-20/ResNet-50 scale) and ImageNet (ResNet-50) comparing DC-ASGD to ASGD, SSGD, and sequential SGD.

实验结果

研究问题

  • RQ1Can delayed gradients in ASGD be effectively compensated without sacrificing the speed advantages of asynchronous updates?
  • RQ2How well does a Taylor-based delay compensation coupled with a Hessian approximation perform in non-convex neural networks under bounded delay?
  • RQ3Does DC-ASGD offer superior convergence speed and final accuracy compared to ASGD and SSGD, approaching sequential SGD?
  • RQ4What is the impact of different λ settings (constant vs adaptive) on stability, variance, and performance?
  • RQ5How does DC-ASGD scale to large datasets like ImageNet with many workers?

主要发现

  • DC-ASGD outperforms both ASGD and SSGD in convergence speed and final accuracy on CIFAR-10 across different numbers of workers.
  • For CIFAR-10 with 4 workers, DC-ASGD-c achieves 8.67% error and DC-ASGD-a achieves 8.19%, both better than sequential SGD (8.65%), ASGD (9.27%), and SSGD (9.17%).
  • With 8 workers, DC-ASGD-a attains 8.57% error, outperforming DC-ASGD-c (9.27%), ASGD (10.26%), and SSGD (10.10%).
  • On ImageNet, DC-ASGD-a with 16 workers achieves 25.18% top-1 error, better than ASGD (25.64%) and SSGD (25.30%), while maintaining similar wallclock efficiency to ASGD.
  • Theoretical results show DC-ASGD has an ergodic convergence rate of O(V/√T) under bounded delay, and can outperform ASGD under suitable λ and delay conditions.
  • Adaptive λ variant (DC-ASGD-a) generally yields stronger empirical performance than the constant λ variant (DC-ASGD-c).

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。