QUICK REVIEW

[论文解读] Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Liangchen Luo, Yuanhao Xiong|arXiv (Cornell University)|Feb 26, 2019

Stochastic Gradient Optimization Techniques被引用 189

一句话总结

本文提出 AdaBound 和 AMSBound，这是对 Adam/AMSGrad 的动态-bound 变体，起初是自适应优化器，逐步过渡到 SGD，具有收敛性保证并在多任务中改善泛化。

ABSTRACT

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared with SGD or even fail to converge due to unstable and extreme learning rates. Recent work has put forward some algorithms such as AMSGrad to tackle this issue but they failed to achieve considerable improvement over existing methods. In our paper, we demonstrate that extreme learning rates can lead to poor performance. We provide new variants of Adam and AMSGrad, called AdaBound and AMSBound respectively, which employ dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to SGD and give a theoretical proof of convergence. We further conduct experiments on various popular tasks and models, which is often insufficient in previous work. Experimental results show that new variants can eliminate the generalization gap between adaptive methods and SGD and maintain higher learning speed early in training at the same time. Moreover, they can bring significant improvement over their prototypes, especially on complex deep networks. The implementation of the algorithm can be found at https://github.com/Luolc/AdaBound .

研究动机与目标

动机：解释自适应优化器如 Adam/AMSGrad 在泛化与收敛方面的局限性。
提出将学习率上界从自适应行为过渡到 SGD 的机制。
在凸设定下为新方法提供理论收敛性保证。
在多种体系结构的计算机视觉和自然语言处理任务中展示经验收益。

提出的方法

通过对每参数学习率进行裁剪，设定随时间演化并收敛到最终步长的下界和上界来实现 AdaBound。
定义 eta_l(t) 和 eta_u(t)，实现从 Adam/AMSGrad 到 SGD(M) 的渐进变换。
在凸假设下证明 AdaBound (以及 AMSBound) 的后悔界和收敛性质。
通过在 MNIST、CIFAR-10 和 Penn Treebank 上的实验，将 AdaBound/AMSBound 与 Sgd(M)、AdaGrad、Adam 和 AMSGrad 进行比较。
提供实现细节并讨论超参数选择与边界时间表（bound schedules）。

实验结果

研究问题

RQ1动态边界学习率计划是否能防止极端更新并改善自适应优化器的泛化？
RQ2AdaBound 和 AMSBound 是否在实现类似 SGD 的泛化的同时保留快速初始收敛？
RQ3在凸设置下，这些基于界限的自适应方法的理论保证（收敛性/后悔）是什么？
RQ4与基线优化器相比，所提方法在多样的架构和任务（计算机视觉和自然语言处理）上的表现如何？
RQ5是否存在一个实用、可调的边界时间表，在无需大量超参数调优的情况下也能良好工作？

主要发现

AdaBound/AMSBound 实现了与自适应方法类似的快速早期训练，并在收敛时具有与 SGD/M 相当或更好的强泛化。
动态边界确保从自适应行为到 SGD 的平滑过渡，缓解极端学习率带来的问题。
理论分析给出在凸性下的后悔界为 O(sqrt(T)) 的收敛保证。
在 MNIST、CIFAR-10 和 Penn Treebank 上的实验结果显示测试准确率和困惑度优于 Adam/AMSGrad，并与 SGD(M) 具备竞争力。
复杂模型（DenseNet、ResNet、多层 LSTM）显示更大收益，凸显在更深层次架构中的优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。