QUICK REVIEW

[论文解读] AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Juntang Zhuang, Tommy Tang|arXiv (Cornell University)|Oct 14, 2020

Generative Adversarial Networks and Image Synthesis被引用 219

一句话总结

AdaBelief 通过测量对观测梯度的信念来调整每个参数的步长，实现快速收敛、良好泛化和训练稳定性。

ABSTRACT

Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability.We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the "belief" in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer. Code is available at https://github.com/juntang-zhuang/Adabelief-Optimizer

研究动机与目标

在像 GAN 这样的具有挑战性设置中，结合自适应方法的快速收敛与良好泛化及稳定性的动机。
提出 AdaBelief 作为一个从 Adam 得来的优化器，通过梯度预测误差来调整步长。
在凸和非凸情形下提供理论收敛性分析。
在图像分类、语言建模和 GAN 上对 AdaBelief 进行实证验证，显示改进的性能和稳定性。

提出的方法

将 AdaBelief 定义为对 Adam 的修改，其中更新使用 m_t / sqrt(s_t) 而不是 m_t / sqrt(v_t)。
其中 m_t 是梯度的 EMA，v_t 是 g_t^2 的 EMA，而 s_t 是 (g_t - m_t)^2 的 EMA。
对 m_t 和 s_t 进行偏置修正，并使用投影到凸集的更新，使用偏置修正后的 s_t 的平方根（再加上 epsilon）进行更新。
将 1/sqrt(s_t) 解释为对当前梯度观测的“信念”，当观测与预测一致时增加步长，偏离时减少步长。
提供直觉和直观示例，展示 AdaBelief 捕捉曲率信息并通过梯度符号和幅度区分更新。
给出凸与非凸随机优化情形下的收敛性分析。

实验结果

研究问题

RQ1AdaBelief 能否在保持自适应方法快速收敛的同时，提升如 SGD 那样的泛化？
RQ2在 GAN 和其他具有挑战性的设置中，AdaBelief 能否提供训练稳定性，同时保持有竞争力的准确性？
RQ3在凸与非凸优化的理论和经验上，AdaBelief 的表现如何？
RQ4在现实任务如图像分类和语言建模中，基于信念的缩放带来哪些实际好处？

主要发现

AdaBelief 在图像分类任务中实现了与 Adam 相当的快速收敛和与 SGD 相似的泛化。
AdaBelief 在 GAN 训练中相较于 Adam 展示了更好的训练稳定性和样本质量，覆盖 CIFAR-10 的小型和较大生成器。
在 ImageNet 上，当使用 Adam 风格的默认设置时，AdaBelief 达到了与 SGD 相当的准确率，减少了一些自适应方法带来的泛化差距。
在语言建模中，AdaBelief 相较于竞争性优化器带来更好的困惑度。
在 GAN 基准测试（WGAN、WGAN-GP）中，AdaBelief 的 FID 得分低于若干基线，表明图像保真度和多样性更高。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。