QUICK REVIEW

[论文解读] On the Convergence of Stochastic Gradient Descent with Adaptive\n Stepsizes

Xiaoyu Li, Francesco Orabona|arXiv (Cornell University)|May 21, 2018

Stochastic Gradient Optimization Techniques被引用 107

一句话总结

本论文分析一种广义 AdaGrad 式自适应步长的 SGD，证明在非凸和凸设定下几乎必然收敛到零梯度，并展示自适应的有限时间收敛速率，在 GD 与 SGD 之间插值同时对梯度噪声进行自适应。

ABSTRACT

Stochastic gradient descent is the method of choice for large scale\noptimization of machine learning objective functions. Yet, its performance is\ngreatly variable and heavily depends on the choice of the stepsizes. This has\nmotivated a large body of research on adaptive stepsizes. However, there is\ncurrently a gap in our theoretical understanding of these methods, especially\nin the non-convex setting. In this paper, we start closing this gap: we\ntheoretically analyze in the convex and non-convex settings a generalized\nversion of the AdaGrad stepsizes. We show sufficient conditions for these\nstepsizes to achieve almost sure asymptotic convergence of the gradients to\nzero, proving the first guarantee for generalized AdaGrad stepsizes in the\nnon-convex setting. Moreover, we show that these stepsizes allow to\nautomatically adapt to the level of noise of the stochastic gradients in both\nthe convex and non-convex settings, interpolating between $O(1/T)$ and\n$O(1/\\sqrt{T})$, up to logarithmic terms.\n

研究动机与目标

在超越凸性/有界域假设的前提下，激发并分析 SGD 的自适应步长。
在凸性和非凸性设定下，使用广义 AdaGrad 步长证明梯度几乎必然收敛到零。
证明自适应步长能够自动适应梯度噪声水平，在 GD 与 SGD 速率之间插值。

提出的方法

研究两种广义 AdaGrad 型步长规则：全局步长 eta_t = alpha / (beta + sum_{i=1}^{t-1} ||g(x_i, xi_i)||^2)^{1/2 + epsilon} 和坐标分量 eta_{t,j} = alpha / (beta + sum_{i=1}^{t-1} g(x_i, xi_i)_j^2)^{1/2 + epsilon}。
在 Lipschitz 光滑性、噪声有界支撑的假设下，证明在上述步长下 SGD 梯度的几乎必然收敛到零。
在凸设定中推导自适应有限时间收敛速率，表示在噪声较小时接近 GD，在噪声较大时接近 SGD 的插值。
给出非凸设定下使用自适应步长的最佳迭代点的收敛速率，显示对噪声的自适应而无需事先知道噪声水平。

实验结果

研究问题

RQ1广义 AdaGrad 步长是否能在非凸设定下保证梯度几乎必然收敛到零？
RQ2自适应步长是否能对梯度噪声进行自适应，从而在凸问题中给出在 GD 与 SGD 之间插值的有限时间收敛速率？
RQ3在非凸设定下是否也存在类似的自适应速率，特别是针对最佳迭代点而非最后迭代点？

主要发现

带有广义 AdaGrad 步长的 SGD 在非凸和凸情形下均几乎必然收敛到零梯度。
在凸问题中，该方法对噪声水平自适应，并在 GD 与 SGD 速率之间插值，至多带有 polylog 项。
在非凸设定中，广义 AdaGrad 步长提供自适应的有限时间收敛速率，噪声越低越优，并扩展到最佳迭代点的保证。
该分析提供了首个理论支撑，表明 AdaGrad 式步长在非凸优化中相对于普通 SGD 有优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。