QUICK REVIEW

[论文解读] AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

Zhiming Zhou, Qingru Zhang|arXiv (Cornell University)|Sep 29, 2018

Stochastic Gradient Optimization Techniques参考文献 12被引用 29

一句话总结

本文提出AdaShift，一种新颖的自适应学习率方法，通过时间延迟梯度$g_{t-n}$计算二阶矩估计$v_t$，从而将$v_t$与当前梯度$g_t$解耦，解决了Adam方法的非收敛问题。通过使用延迟梯度$g_{t-n}$计算$v_t$，AdaShift确保了无偏步长，实现了收敛，同时保持了Adam的训练速度和泛化性能，该方法在多个深度学习基准测试中得到验证，包括MNIST、CIFAR-10、Tiny-ImageNet、GANs和NMT模型。

ABSTRACT

Adam is shown not being able to converge to the optimal solution in certain cases. Researchers recently propose several algorithms to avoid the issue of non-convergence of Adam, but their efficiency turns out to be unsatisfactory in practice. In this paper, we provide new insight into the non-convergence issue of Adam as well as other adaptive learning rate methods. We argue that there exists an inappropriate correlation between gradient $g_t$ and the second-moment term $v_t$ in Adam ($t$ is the timestep), which results in that a large gradient is likely to have small step size while a small gradient may have a large step size. We demonstrate that such biased step sizes are the fundamental cause of non-convergence of Adam, and we further prove that decorrelating $v_t$ and $g_t$ will lead to unbiased step size for each gradient, thus solving the non-convergence problem of Adam. Finally, we propose AdaShift, a novel adaptive learning rate method that decorrelates $v_t$ and $g_t$ by temporal shifting, i.e., using temporally shifted gradient $g_{t-n}$ to calculate $v_t$. The experiment results demonstrate that AdaShift is able to address the non-convergence issue of Adam, while still maintaining a competitive performance with Adam in terms of both training speed and generalization.

研究动机与目标

识别Adam及其他自适应学习率方法非收敛的根本原因。
证明由于$v_t$与$g_t$之间的相关性导致的有偏步长是收敛失败的根本原因。
提出一种方法，通过解耦$v_t$与$g_t$，实现无偏且可收敛的步长。
设计一种实用的自适应优化器，在确保收敛的同时保持训练效率和泛化能力。
在多种深度学习任务中验证所提方法的性能，包括前馈网络、CNN、GAN和RNN。

提出的方法

引入新视角：通过分析每个梯度的累积步长（净更新因子）来研究收敛性。
提出AdaShift，使用时间上延迟的梯度$g_{t-n}$而非$g_t$来计算$v_t$，从而实现$v_t$与当前梯度的解耦。
定义$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_{t-n}^2$，打破$v_t$与$g_t$之间的直接相关性。
保持与Adam相同的更新规则：$\theta_{t+1} = \theta_t - \frac{\alpha_t}{\sqrt{v_t}} m_t$，但使用解耦后的$v_t$。
对$v_t$在各层上应用空间操作（如最大池化），以提升稳定性和泛化性能，得到max-AdaShift。
理论上证明，解耦可导致无偏期望步长，从而在较弱条件下确保收敛。

实验结果

研究问题

RQ1为何Adam在某些优化问题中虽被广泛使用却仍无法收敛？
RQ2自适应学习率方法（如Adam）中非收敛行为的根本原因是什么？
RQ3能否通过将二阶矩估计$v_t$与当前梯度$g_t$解耦，实现无偏步长并提升收敛性？
RQ4是否可以设计一种实用的自适应优化器，在保持Adam效率的同时确保收敛？
RQ5与现有变体（如AMSGrad和AdamNC）相比，所提方法在训练速度、泛化能力和收敛性方面表现如何？

主要发现

AdaShift通过时间延迟实现$v_t$与$g_t$的解耦，解决了Adam的非收敛问题，确保了无偏步长和理论上的收敛性。
在多层感知机（MLP）上的MNIST任务中，AdaShift（尤其是non-AdaShift）的泛化性能优于Adam和AMSGrad，且训练损失波动轻微。
在CIFAR-10上的ResNet和DenseNet模型中，AdaShift在测试准确率和训练损失方面与Adam相当或略优，而AMSGrad表现更差。
在Tiny-ImageNet上的DenseNet模型中，AdaShift的测试准确率高于Adam，尽管两者训练损失曲线相似。
在WGAN-GP训练中，AdaShift在判别器性能方面显著优于Adam和AMSGrad。
在神经机器翻译（NMT）任务中，AdaShift取得了最高的BLEU分数，优于Adam和AMSGrad。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。