QUICK REVIEW

[论文解读] On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks

Fangyu Zou, Li Shen|arXiv (Cornell University)|Aug 10, 2018

Stochastic Gradient Optimization Techniques被引用 17

一句话总结

本文提出 AdaUSM，一种统一的自适应随机优化方法，结合了加权自适应学习率与广义动量方案，该方案涵盖了重球法和Nesterov动量。在非凸随机设置下，该方法建立了 O(log(T)/√T) 的收敛速率，并为 Adam 和 RMSProp 提供了理论见解，表明它们是具有指数增长权重的特殊情况。

ABSTRACT

Integrating adaptive learning rate and momentum techniques into SGD leads to a large class of efficiently accelerated adaptive stochastic algorithms, such as Nadam, AccAdaGrad, extit{etc}. In spite of their effectiveness in practice, there is still a large gap in their theories of convergences, especially in the difficult non-convex stochastic setting. To fill this gap, we propose \emph{weighted AdaGrad with unified momentum}, dubbed AdaUSM, which has the main characteristics that (1) it incorporates a unified momentum scheme which covers both the heavy ball momentum and the Nesterov accelerated gradient momentum; (2) it adopts a novel weighted adaptive learning rate that can unify the learning rates of AdaGrad, AccAdaGrad, Adam, and RMSProp. Moreover, when we take polynomially growing weights in AdaUSM, we obtain its $\mathcal{O}(\log(T)/\sqrt{T})$ convergence rate in the non-convex stochastic setting. We also show that the adaptive learning rates of Adam and RMSProp correspond to taking exponentially growing weights in AdaUSM, which thereby provides a new perspesctive for understanding Adam and RMSProp. Lastly, comparative experiments of AdaUSM against SGD with momentum, AdaGrad, AdaEMA, Adam, and AMSGrad on various deep learning models and datasets are also provided.

研究动机与目标

填补非凸深度学习设置下自适应随机优化方法收敛分析的理论空白。
在单一框架下统一现有自适应方法（如 AdaGrad、Adam 和 RMSProp），并引入广义动量方案。
设计一种加权自适应学习率机制，以推广主要自适应优化器的学习率调度。
在非凸随机优化中为所提方法建立理论收敛速率。
通过加权自适应学习率，为 Adam 和 RMSProp 提供新的理论解释。

提出的方法

提出 AdaUSM，一种整合加权自适应学习率与广义动量方案的统一优化框架。
引入统一的动量公式，将重球法和 Nesterov 加速梯度动量作为特例包含在内。
设计一种加权自适应学习率，以推广 AdaGrad、AccAdaGrad、Adam 和 RMSProp 的学习率。
在 AdaUSM 中使用多项式增长权重，推导出在非凸随机设置下的 O(log(T)/√T) 收敛速率。
证明 Adam 和 RMSProp 对应于 AdaUSM 中的指数增长权重，为这些方法提供了新的理论视角。
在多种深度学习模型和数据集上实现并评估 AdaUSM，与 SGD 带动量、AdaGrad、AdaEMA、Adam 和 AMSGrad 进行对比。

实验结果

研究问题

RQ1能否开发一种统一的优化框架，同时整合自适应学习率与广义动量？
RQ2所提方法在非凸随机优化设置下的理论收敛速率是什么？
RQ3在特定权重增长模式下，现有方法（如 Adam 和 RMSProp）与所提框架有何关系？
RQ4所提方法是否在收敛性或泛化性能上优于最先进自适应优化器？
RQ5通过加权自适应学习率的视角，能否更好地理解 Adam 和 RMSProp 的理论行为？

主要发现

当使用多项式增长权重时，AdaUSM 在非凸随机设置下实现了 O(log(T)/√T) 的收敛速率。
证明了 Adam 和 RMSProp 的自适应学习率是 AdaUSM 中指数增长权重的特例。
AdaUSM 中的统一动量方案将重球法和 Nesterov 动量作为极限情况包含在内。
对比实验表明，AdaUSM 在多个深度学习模型和数据集上与 SGD 带动量、AdaGrad、AdaEMA、Adam 和 AMSGrad 表现相当。
理论框架为 Adam 和 RMSProp 在实际中的行为与收敛性提供了新视角。
加权自适应学习率机制成功地将学习率调度推广至多种自适应优化方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。