QUICK REVIEW

[论文解读] On the Variance of the Adaptive Learning Rate and Beyond

Liyuan Liu, Haoming Jiang|arXiv (Cornell University)|Aug 8, 2019

Advanced Neural Network Applications参考文献 28被引用 607

一句话总结

论文分析了学习率预热为何通过在早期训练减少自适应学习率的方差来帮助像 Adam 这样的自适应优化器，并引入 Rectified Adam (RAdam) 以显式纠正这一方差，具备扎实的理论基础和强有力的实证结果。

ABSTRACT

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the effectiveness and robustness of our proposed method. All implementations are available at: https://github.com/LiyuanLucasLiu/RAdam.

研究动机与目标

识别自适应优化器在早期训练阶段训练不稳定的根本原因。
为预热作为方差减少提供理论依据。
提出一个方差纠正的 Adam 变体 (RAdam) 并分析其性质。
在语言模型、图像分类和机器翻译任务中对 RAdam 进行实证验证。

提出的方法

给出一个以动量和自适应步长为参数的通用自适应优化框架。
分析自适应学习率的方差，并在样本量较小时方差较大。
引入两种减小方差的变体（Adam-2k 和 Adam-eps）以在经验上支持方差问题。
推导一个纠正因子 r_t，以基于测量的 rho_t（有效 SMA 长度）来标准化自适应学习率的方差。
提出 Rectified Adam (RAdam)，当 rho_t > 4 时应用方差纠正项，否则使用未自适应的动量更新。
提供 RAdam 的算法（算法 2），包含实用步骤和偏差校正。

实验结果

研究问题

RQ1早期阶段自适应学习率的高方差是否会导致不稳定性或在 Adam 中收敛到坏的局部最优？
RQ2是否可以将预热理论性地解释为自适应优化器的方差减少？
RQ3我们能否设计一个有原则的纠正方法，在不再调额外超参数的情况下稳定自适应学习率？
RQ4与原生 Adam 和预热基线相比，所提的 RAdam 在语言建模、图像分类和神经机器翻译任务中的表现如何？

主要发现

自适应学习率的方差在早期训练阶段较大，因样本有限而导致更新不稳定。
预热可以被解释为自适应优化器的一种方差降低技术。
Rectified Adam (RAdam) 在早期阶段降低方差，并在各任务上达到或超过 Adam 的性能，同时对学习率变化具有鲁棒性。
在语言建模（One Billion Word）和图像分类（CIFAR10、ImageNet）任务上，RAdam 显示出对原生 Adam 的一致改进。
在神经机器翻译数据集（IWSLT’14 De-En/En-De、WMT’16 En-De）上，RAdam 能在不需要大量超参数调优的情况下达到与带预热的 Adam 相当的性能。
模拟与理论分析支持方差纠正机制及其实际有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。