QUICK REVIEW

[論文レビュー] WNGrad: Learn the Learning Rate in Gradient Descent

Xiaoxia Wu, Rachel Ward|arXiv (Cornell University)|Mar 7, 2018

Stochastic Gradient Optimization Techniques参考文献 21被引用数 55

ひとこと要約

WNGrad は勾配観測に基づいて適応する動的学習率更新則を導入し、リップシッツ定数に対する頑健性を実現し、バッチおよび確率的設定でほぼ最適な収束を達成します。

ABSTRACT

Adjusting the learning rate schedule in stochastic gradient methods is an important unresolved problem which requires tuning in practice. If certain parameters of the loss function such as smoothness or strong convexity constants are known, theoretical learning rate schedules can be applied. However, in practice, such parameters are not known, and the loss function of interest is not convex in any case. The recently proposed batch normalization reparametrization is widely adopted in most neural network architectures today because, among other advantages, it is robust to the choice of Lipschitz constant of the gradient in loss function, allowing one to set a large learning rate without worry. Inspired by batch normalization, we propose a general nonlinear update rule for the learning rate in batch and stochastic gradient descent so that the learning rate can be initialized at a high value, and is subsequently decreased according to gradient observations along the way. The proposed method is shown to achieve robustness to the relationship between the learning rate and the Lipschitz constant, and near-optimal convergence rates in both the batch and stochastic settings ($O(1/T)$ for smooth loss in the batch setting, and $O(1/\sqrt{T})$ for convex loss in the stochastic setting). We also show through numerical evidence that such robustness of the proposed method extends to highly nonconvex and possibly non-smooth loss function in deep learning problems.Our analysis establishes some first theoretical understanding into the observed robustness for batch normalization and weight normalization.

研究の動機と目的

重要な損失関数定数が未知である場合に、確率的勾配法における学習率スケジュールの選択の難しさを動機づけ、対処する。
リパラメータ化に触発された学習率更新を提案し、初期値を大きく設定し、観測された勾配に基づいて適応させる。
WNGrad のバッチ（滑らかで非凸）および確率的設定（凸で必ずしも滑らかでない）の両方に対する収束保証を確立する。
標準データセット（MNISTとCIFAR-10）での数値実験を通じて、頑健性と実用的な性能を示す。

提案手法

更新式を用いた WNGrad を導入する: x_{k+1} = x_k - (1/b_k) ∇f(x_k) および b_{k+1} = b_k + (1/b_k) ∥∇f(x_k)∥^2。
b_k が未知のリップシッツ定数 L に対して頑健になるよう、b_k が L 以下で安定化するレベルまで増大することを示す。
滑らかな f に対するグローバル収束を証明する: min_k ∥∇f(x_k)∥^2 ≤ ε。T の境界は f(x_1)、f*、および L に依存する。
確率的収束を証明する: 寄与変数の仮定の下で convexity および分散仮定のもと、 f( x̄_k ) - f* ≤ G^2(D^2+2)/(γ√k) + (b_1 ∥x_1 - x*∥^2)/(2k)。
平方根計算を必要とせず AdaGrad に類似した挙動として WNGrad を関連づけ、効率性とスケール不変性を強調する。

実験結果

リサーチクエスチョン

RQ1単一の動的更新学習率パラメータ b_k が、リップシッツ定数を知らなくても、バッチと確率的勾配設定の両方で収束保証をもたらすか？
RQ2提案された b_k の更新は、学習率スケールの選択に対して頑健性を提供し、非凸ニューラルネットワーク問題での汎化性能を向上させるか？
RQ3バッチ（滑らか）および確率的設定（凸、必ずしも滑らかでない）における WNGrad の理論的収束速度は何か？
RQ4標準データセット（MNIST、CIFAR-10）における WNGrad の実データ上の性能は、SGD および適応法と比較してどうか？
RQ5モーメント変種（WN-Adam、WNGrad-Momentum）は、学習率スケールに対する頑健性にどのように影響するか？

主な発見

バッチ設定の滑らかな f に対して、 ∥∇f(x_T)∥^2 ≤ ε を満たす点へ O((f(x_1)−f*+L)^2/ε) 回の反復で収束する。
確率的設定では、 b_k は O(√k/G) のように増加し、凸損失に対して最適な O(1/√T) 率を与える。
このスキームはスケール不変である。f を定数倍拡大しても、WNGrad の反復は変わらない。
モーメントを用いた WNGrad や Adam 系は、リップシッツ定数のスケールに対して頑健性を維持し、実験の一部の設定で標準の SGD/Adam を上回る。
MNIST および CIFAR-10 の数値実験は、勾配リップシッツ定数に対する頑健性と、SGD と比較して競争力のある汎化性能を WNGrad が示すことを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。