QUICK REVIEW

[論文レビュー] AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

Rachel Ward, Xiaoxia Wu|arXiv (Cornell University)|Jun 5, 2018

Advanced Optimization Algorithms Research被引用数 136

ひとこと要約

AdaGrad-Norm は滑らかな非凸最適化において停留点へ収束し、確率的設定で O(log(N)/sqrt(N)、決定論的設定で O(1/N) の収束速度を達成し、ハイパーパラメータに対してロバストである。

ABSTRACT

Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune the stepsize schedule. Yet, the theoretical guarantees to date for AdaGrad are for online and convex optimization. We bridge this gap by providing theoretical guarantees for the convergence of AdaGrad for smooth, nonconvex functions. We show that the norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the $\mathcal{O}(\log(N)/\sqrt{N})$ rate in the stochastic setting, and at the optimal $\mathcal{O}(1/N)$ rate in the batch (non-stochastic) setting -- in this sense, our convergence guarantees are 'sharp'. In particular, the convergence of AdaGrad-Norm is robust to the choice of all hyper-parameters of the algorithm, in contrast to stochastic gradient descent whose convergence depends crucially on tuning the step-size to the (generally unknown) Lipschitz smoothness constant and level of stochastic noise on the gradient. Extensive numerical experiments are provided to corroborate our theory; moreover, the experiments suggest that the robustness of AdaGrad-Norm extends to state-of-the-art models in deep learning, without sacrificing generalization.

研究の動機と目的

正確なリプシッツ定数やノイズレベルを調整せずに頑健な最適化を動機づける。
滑らかで非凸な設定における AdaGrad-Norm の理論的収束保証を提供する。
確率的および決定論的な収束速度を導出し、ハイパーパラメータの影響を明らかにする。
L とノイズが未知の場合のハイパーパラメータ設定について実用的な指針を提供する。

提案手法

AdaGrad-Norm の更新を定義する: x_{j+1} = x_j - (η / b_{j+1}) G_j を、 b_{j+1}^{2} = b_j^{2} + ||G_j||^{2} とする。
G_j は期待値ゼロの勾配推定量であり、分散と勾配ノルムが有界であり、 ||∇F(x)|| ≤ γ であると仮定する。
確率的設定と決定論的設定に対する収束結果（定理2.1および2.2）を証明する。
降下補題と付随的境界を用いて、 b_j と G_j の間の相関乱数性を扱う。
収束速度の主張を提供し、固定ステップサイズを用いる SGD と比較してハイパーパラメータへのロバスト性を強調する。
F* が既知の場合の実用的なパラメータ選択を提供する（η = F(x0) − F* として、かつ b0 を小さくする）。

実験結果

リサーチクエスチョン

RQ1確率的勾配下で、AdaGrad-Norm は滑らかな非凸 F の停留点に収束するか。
RQ2AdaGrad-Norm の確率的および決定論的設定での収束速度はどうなるか、ハイパーパラメータはどのように影響するか。
RQ3リプシッツ定数 L やノイズ σ を知らずに、η および b0 の任意の正の選択に対して AdaGrad-Norm は頑健か。
RQ4収束速度の定数が初期条件とハイパーパラメータにどう依存するか。

主な発見

確率的設定では、AdaGrad-Norm は ε-近傍の停留点へ O(log(N)/sqrt(N)) の速度で収束する。
決定論的設定では、最適な O(1/N) 速度を達成する。
収束は任意の η>0 および b0>0 に対して成立し、ハイパーパラメータの選択に頑健であることを示す。
収束定数は明示的に b0 および η に依存し、実用的なパラメータ設定の指針が提供される。
固定ステップサイズの SGD と比較して、AdaGrad-Norm は smoothness L やノイズ σ の事前知識を要求せず頑健な収束を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。