QUICK REVIEW

[论文解读] AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

Rachel Ward, Xiaoxia Wu|arXiv (Cornell University)|Jun 5, 2018

Advanced Optimization Algorithms Research被引用 136

一句话总结

AdaGrad-Norm 收敛到平稳点，在光滑的非凸优化中，在随机设置下达到 O(log(N)/sqrt(N)) 的收敛速率，在确定性设置下达到 O(1/N) 的速率，并对超参数具有鲁棒性。

ABSTRACT

Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune the stepsize schedule. Yet, the theoretical guarantees to date for AdaGrad are for online and convex optimization. We bridge this gap by providing theoretical guarantees for the convergence of AdaGrad for smooth, nonconvex functions. We show that the norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the $\mathcal{O}(\log(N)/\sqrt{N})$ rate in the stochastic setting, and at the optimal $\mathcal{O}(1/N)$ rate in the batch (non-stochastic) setting -- in this sense, our convergence guarantees are 'sharp'. In particular, the convergence of AdaGrad-Norm is robust to the choice of all hyper-parameters of the algorithm, in contrast to stochastic gradient descent whose convergence depends crucially on tuning the step-size to the (generally unknown) Lipschitz smoothness constant and level of stochastic noise on the gradient. Extensive numerical experiments are provided to corroborate our theory; moreover, the experiments suggest that the robustness of AdaGrad-Norm extends to state-of-the-art models in deep learning, without sacrificing generalization.

研究动机与目标

在不调校精确的Lipschitz常数或噪声水平的情况下，推动鲁棒优化。
给出 AdaGrad-Norm 在光滑的非凸设置中的理论收敛保证。
推导随机和确定性收敛速率并阐明超参数的影响。
在未知L和噪声时，提供设置超参数的实际指导。

提出的方法

定义 AdaGrad-Norm 更新：x_{j+1} = x_j - (η / b_{j+1}) G_j，且 b_{j+1}^{2} = b_j^{2} + ||G_j||^{2}。
假设 G_j 是一个无偏梯度估计，具有界定的方差和梯度范数，并且 ||∇F(x)|| ≤ γ。
分别证明随机设置和确定性设置下的收敛结果（定理 2.1 和 2.2）。
利用下降引理和辅助界来处理 b_j 与 G_j 之间相关性的随机性。
给出收敛速率的表述，并与具有固定步长的 SGD 进行比较，强调对超参数的鲁棒性。
当 F* 已知时提供实际的参数选择（η = F(x0) − F*）和 b0 小。

实验结果

研究问题

RQ1在随机梯度下，AdaGrad-Norm 是否会收敛到光滑非凸 F 的平稳点？
RQ2在随机和确定性设置下，AdaGrad-Norm 的收敛速率是多少，超参数如何影响它们？
RQ3在不知道 Lipschitz 常数 L 或噪声 σ 的情况下，AdaGrad-Norm 对任意正的 η 与 b0 的选择是否鲁棒？
RQ4收敛速率中的常数如何依赖于初始条件和超参数？

主要发现

在随机设置下，AdaGrad-Norm 以 O(log(N)/sqrt(N)) 的速率收敛到 ε-近似的平稳点。
在确定性设置下，AdaGrad-Norm 达到最优的 O(1/N) 速率。
收敛对任意 η>0 和 b0>0 成立，显示出对超参数选择的鲁棒性。
收敛常数显式地依赖于 b0 和 η，并提供实用参数设置的指南。
与具有固定步长的 SGD 相比，AdaGrad-Norm 在不需要先验的平滑度 L 或噪声 σ 的情况下实现鲁棒收敛。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。