QUICK REVIEW

[論文レビュー] Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration

Soham De, Anirbit Mukherjee|arXiv (Cornell University)|Jul 18, 2018

Stochastic Gradient Optimization Techniques参考文献 37被引用数 82

ひとこと要約

本論文は滑らかな非凸最適化における RMSProp および ADAM の収束保証を提供し、Nesterov 加速と CIFAR-10・オートエンコーダーで実証的に比較している。また、特に ADAM のモーメンタムパラメータの感度を含むハイパーパラメータの影響を分析している。

ABSTRACT

RMSProp and ADAM continue to be extremely popular algorithms for training neural nets but their theoretical convergence properties have remained unclear. Further, recent work has seemed to suggest that these algorithms have worse generalization properties when compared to carefully tuned stochastic gradient descent or its momentum variants. In this work, we make progress towards a deeper understanding of ADAM and RMSProp in two ways. First, we provide proofs that these adaptive gradient algorithms are guaranteed to reach criticality for smooth non-convex objectives, and we give bounds on the running time. Next we design experiments to empirically study the convergence and generalization properties of RMSProp and ADAM against Nesterov's Accelerated Gradient method on a variety of common autoencoder setups and on VGG-9 with CIFAR-10. Through these experiments we demonstrate the interesting sensitivity that ADAM has to its momentum parameter $β_1$. We show that at very high values of the momentum parameter ($β_1 = 0.99$) ADAM outperforms a carefully tuned NAG on most of our experiments, in terms of getting lower training and test losses. On the other hand, NAG can sometimes do better when ADAM's $β_1$ is set to the most commonly used value: $β_1 = 0.9$, indicating the importance of tuning the hyperparameters of ADAM to get better generalization performance. We also report experiments on different autoencoders to demonstrate that NAG has better abilities in terms of reducing the gradient norms, and it also produces iterates which exhibit an increasing trend for the minimum eigenvalue of the Hessian of the loss function at the iterates.

研究の動機と目的

適応勾配法（RMSProp および ADAM）の非凸最適化における初の収束保証を提供する。
滑らかさ仮定の下で、近似的臨界点に到達するまでの実行時間境界を導出する。
オートエンコーダと CIFAR-10 上で、RMSProp および ADAM を Nesterov Accelerated Gradient（NAG）と経験的に比較する。
特に ADAM のモーメンタムパラメータ β1 をはじめとするハイパーパラメータの感度と、一般化傾向を強調する。

提案手法

L-スムースな非凸目的関数と有限和構造 f(x)=k^{-1} sum_p f_p(x) を定義する。
決定論的および確率的設定の下で、RMSProp および ADAM の更新を導入・分析する。
技術的オラクル仮定の下で、確率的 RMSProp が近似的臨界点へ収束することを証明する。
オートエンコーダと CIFAR-10 の VGG-9 で実験を通じて、Nesterov Accelerated Gradient（NAG）と比較する。
適応法の対角事前条件付きフレームワークと対応する収束議論を利用する。

実験結果

リサーチクエスチョン

RQ1RMSProp および ADAM は、滑らかな非凸最適化において近似的臨界点へ収束するのか？
RQ2これらの適応法がほぼ定常性に到達するまでの実行時間境界はどのようなものか？
RQ3ニューラルネットの訓練と一般化の観点で、RMSProp および ADAM はNesterov 加速とどう比較されるか？
RQ4モーメンタムパラメータ β1 が ADAM の性能と一般化にどのように影響するか？
RQ5ネットワーク規模が拡大するにつれて、適応法は非適応法と異なる一般化を示すのか？

主な発見

滑らかな非凸目的において、適応勾配法（RMSProp および ADAM）が近似的臨界点へ到達するという初の収束保証が確立された。
勾配オラクルに関する追加仮定の下で、確率的 RMSProp の収束が示される。
経験的な結果は、ADAM がモーメンタムパラメータ β1 に高い感度を示すことを示しており、β1=0.99 は多くのタスクで慎重に調整したNAGに匹敵するか上回ることがある。
全バッチおよび大規模ネット regime では、ADAM の大きな β1 はオートエンコーダ上で NAG および RMSProp と比較して訓練損失とテスト損失を低くすることがある。
NAG は勾配ノルムを低減させ、オートエンコーダ上で最小ヘッセ行列固有値のトレンドを増加させる反復を生み出す傾向がある。
CIFAR-10 の VGG-9 では、経験的比較はオートエンコーダを越えて拡張されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。