QUICK REVIEW

[论文解读] Weighted AdaGrad with Unified Momentum

Fangyu Zou, Li Shen|arXiv (Cornell University)|Aug 10, 2018

Stochastic Gradient Optimization Techniques参考文献 11被引用 41

一句话总结

本文提出 AdaUSM，一种统一的自适应随机优化方法，结合广义动量方案与加权自适应学习率，在非凸随机设置下实现 O(log(T)/√T) 的收敛速率。该方法将 Adam、RMSProp、AdaGrad 和 AccAdaGrad 统一于同一框架下，通过多项式与指数加权方案，为这些方法的行为提供了理论洞察。

ABSTRACT

Integrating adaptive learning rate and momentum techniques into SGD leads to a large class of efficiently accelerated adaptive stochastic algorithms, such as Nadam, AccAdaGrad, extit{etc}. In spite of their effectiveness in practice, there is still a large gap in their theories of convergences, especially in the difficult non-convex stochastic setting. To fill this gap, we propose \emph{weighted AdaGrad with unified momentum}, dubbed AdaUSM, which has the main characteristics that (1) it incorporates a unified momentum scheme which covers both the heavy ball momentum and the Nesterov accelerated gradient momentum; (2) it adopts a novel weighted adaptive learning rate that can unify the learning rates of AdaGrad, AccAdaGrad, Adam, and RMSProp. Moreover, when we take polynomially growing weights in AdaUSM, we obtain its $\mathcal{O}(\log(T)/\sqrt{T})$ convergence rate in the non-convex stochastic setting. We also show that the adaptive learning rates of Adam and RMSProp correspond to taking exponentially growing weights in AdaUSM, which thereby provides a new perspesctive for understanding Adam and RMSProp. Lastly, comparative experiments of AdaUSM against SGD with momentum, AdaGrad, AdaEMA, Adam, and AMSGrad on various deep learning models and datasets are also provided.

研究动机与目标

弥合非凸设置下自适应随机优化方法收敛分析的理论空白。
将现有自适应方法（如 Adam、RMSProp、AdaGrad 和 AccAdaGrad）统一于单一优化框架之下。
提出一种新型加权自适应学习率，泛化现有学习率调度方案。
为所提方法在非凸随机优化中的理论收敛速率提供保证。
通过统一框架中指数权重增长的视角，为理解 Adam 和 RMSProp 提供新视角。

提出的方法

提出一种统一的动量方案，可涵盖重型球动量与 Nesterov 加速梯度动量。
引入一种加权自适应学习率，泛化 AdaGrad、AccAdaGrad、Adam 和 RMSProp 的学习率。
在自适应学习率中采用多项式增长权重，实现在非凸随机设置下的 O(log(T)/√T) 收敛速率。
推导表明，Adam 和 RMSProp 在所提框架中对应于指数增长权重。
设计一种单一优化算法 AdaUSM，通过权重参数动态结合动量与自适应学习率。
在标准假设下分析收敛性，包括有界梯度与随机梯度。

实验结果

研究问题

RQ1能否设计一种统一的优化框架，以同时整合自适应学习率与广义动量？
RQ2此类统一方法在非凸随机优化中可实现何种理论收敛速率？
RQ3现有方法（如 Adam 和 RMSProp）与该统一框架有何关系？
RQ4不同权重增长模式（多项式 vs. 指数）对收敛性与性能有何影响？
RQ5所提方法是否优于或在理论上更优地支撑现有自适应随机优化器？

主要发现

AdaUSM 在非凸随机优化中实现 O(log(T)/√T) 收敛速率，与现有自适应方法的最佳已知速率一致。
AdaUSM 中的自适应学习率通过单一加权公式，泛化了 AdaGrad、AccAdaGrad、Adam 和 RMSProp 的学习率。
研究表明，Adam 和 RMSProp 在 AdaUSM 中对应于指数增长权重，为它们的行为提供了新的理论解释。
实验表明，AdaUSM 在多个深度学习模型与数据集上，性能优于或等同于 SGD with momentum、AdaGrad、AdaEMA、Adam 和 AMSGrad。
AdaUSM 中的统一动量方案有效捕捉了重型球与 Nesterov 动量作为特例。
AdaUSM 中的多项式权重增长在不牺牲实际性能的前提下，带来了更优的理论收敛保证。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。