QUICK REVIEW

[论文解读] An Adaptive and Momental Bound Method for Stochastic Learning

Jianbang Ding, Xuancheng Ren|arXiv (Cornell University)|Oct 27, 2019

Stochastic Gradient Optimization Techniques参考文献 22被引用 28

一句话总结

本文提出AdaMod，一种新颖的自适应优化方法，通过在Adam学习率上应用基于动量的自适应上限，稳定深度神经网络的训练过程。通过使用指数移动平均平滑自适应率，有效抑制初始阶段过大的学习率，从而消除对学习率预热（warmup）的需求，并在DenseNet和Transformer等复杂架构上实现更优的收敛性和泛化性能。

ABSTRACT

Training deep neural networks requires intricate initialization and careful selection of learning rates. The emergence of stochastic gradient optimization methods that use adaptive learning rates based on squared past gradients, e.g., AdaGrad, AdaDelta, and Adam, eases the job slightly. However, such methods have also been proven problematic in recent studies with their own pitfalls including non-convergence issues and so on. Alternative variants have been proposed for enhancement, such as AMSGrad, AdaShift and AdaBound. In this work, we identify a new problem of adaptive learning rate methods that exhibits at the beginning of learning where Adam produces extremely large learning rates that inhibit the start of learning. We propose the Adaptive and Momental Bound (AdaMod) method to restrict the adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate bounds are based on the exponential moving averages of the adaptive learning rates themselves, which smooth out unexpected large learning rates and stabilize the training of deep neural networks. Our experiments verify that AdaMod eliminates the extremely large learning rates throughout the training and brings significant improvements especially on complex networks such as DenseNet and Transformer, compared to Adam. Our implementation is available at: https://github.com/lancopku/AdaMod

研究动机与目标

解决自适应优化方法（如Adam）在训练初期因学习率过大导致的不稳定性问题。
识别出Adam在训练初期出现的非收敛与泛化性能差，根源在于学习率的不稳定与高幅值。
开发一种方法，通过长期记忆历史梯度信息来稳定学习率，而无需依赖启发式预热调度。
在多种深度学习模型中提升训练稳定性和泛化性能，尤其在Transformer和DenseNet等复杂模型上表现更优。
通过降低对超参数的敏感性，使优化过程对初始学习率选择更具鲁棒性，同时消除手动调整预热的需要。

提出的方法

对Adam计算出的自适应学习率应用指数移动平均（EMA），构建平滑的、基于动量的上界。
将自适应率的EMA用作原始学习率的动态上界，防止其变得过大。
引入一个新的超参数β₃，用于控制EMA的衰减速率，从而实现对历史梯度统计信息的长期记忆。
通过将Adam更新规则中的原始学习率ηₜ替换为min(ηₜ, ŷₜ)（其中ŷₜ为ηₜ的EMA）来修改更新规则，确保学习率有界且稳定。
通过复用Adam现有组件，在引入动量上界机制的同时仅增加极少计算开销，保持计算效率。
通过平滑学习率波动实现端到端的训练稳定性，无需人工干预或预热调度。

实验结果

研究问题

RQ1为何自适应优化器（如Adam）在复杂模型上训练初期会失败收敛？
RQ2是否可以系统性地解决初始阶段因学习率过大导致的不稳定性，而无需依赖学习率预热？
RQ3在自适应学习率上引入基于动量的上界，对深层神经网络的收敛性和泛化性能有何影响？
RQ4AdaMod在多大程度上降低了超参数敏感性，特别是对初始学习率选择的依赖？
RQ5AdaMod是否能在无需额外调参的情况下，于Transformer和DenseNet等多样化架构上超越Adam的性能？

主要发现

AdaMod有效消除了训练初期极大约束学习率的出现，而这类学习率正是导致Adam非收敛的根源。
在IWSLT’14 De-En翻译任务中，Adam无预热时训练损失在约9.5波动并出现发散，而AdaMod实现了稳定收敛且损失更低。
在CIFAR-10上的ResNet-34实验中，AdaMod在初始学习率α ∈ {0.001, 0.01, 0.1}的广泛范围内均保持一致的测试准确率，展现出强鲁棒性。
在IWSLT’14的Transformer-small模型上，AdaMod（β₃ = 0.9999）优于Adam及Adam+预热，实现了最佳的训练损失与泛化性能。
该方法在多个任务与模型中减少了或完全消除了对学习率预热的需求，尤其对复杂架构具有显著优势。
AdaMod在DenseNet和Transformer等复杂模型上实现了SOTA性能，显著优于原始Adam，且无需额外超参数调优。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。