QUICK REVIEW

[論文レビュー] Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning

Pan Zhou, Jiashi Feng|arXiv (Cornell University)|Oct 12, 2020

Stochastic Gradient Optimization Techniques参考文献 41被引用数 57

ひとこと要約

この論文は、勾配ノイズを Lévy 主導の SDEs としてモデル化し、局所盆地からの脱出時間を分析することで SGD の一般化性能が Adam より優れる理由を説明し、 flatter/非対称な局所解への結びつきを示します。

ABSTRACT

It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM~smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones , our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.

研究の動機と目的

深層学習における SGD と Adam の一般化ギャップを動機づけ、定量化する。
勾配ノイズをヘビー尾部で特徴づけ、Lévy 主導の SDE フレームワークを正当化する。
局所盆地からの脱出時間を分析して、盆地選択を理解する。
脱出行動を盆地の Radon 測度と一般化性能に関連づける。
ヘビー尾ノイズ仮定と理論的主張の経験的検証を提供する。

提案手法

SGD と Adam を Lévy 主導の確率微分方程式（SDE）として定式化する。
勾配ノイズが時間依存共分散 Σt を持つヘビー尾部の SαS 分布に従うと仮定する。（SGD は式(4)、Adam は式(5)。）
局所盆地 Ω からの脱出時間 Γ と脱出集合 W を、Radon 測度 m(W) によって定義する。
Γ ~ O(ε^{-α}/m(W)) に従うことを導出し、ノイズ尾部指数 α と盆地のジオメトリに依存することを示す。
Adam の幾何適応が m(W) を減少させ、ノイズ尾部を平滑化して脱出ダイナミクスに影響を与える。
ヘビー尾ノイズの勾配ノイズという経験的観察（図1）で理論を裏付ける。

実験結果

リサーチクエスチョン

RQ1高速な訓練にもかかわらず、Adam のような適応型勾配法がなぜ SGD より一般化が悪いのか？
RQ2勾配ノイズの尾部挙動が局所盆地からの脱出時間にどう影響するか。
RQ3盆地の幾何と Radon 測度が鋭い局所極小からの脱出の確率にどう影響するか。
RQ4なぜ SGD はより大きな Radon 測度を持つ平坦で非対称な盆地に収束しやすく、それがより良い一般化につながるのか？

主な発見

SGD と Adam の勾配ノイズはヘビー尾で、ガウスノイズよりも Lévy（SαS）過程でよく説明される。
盆地からの脱出時間は尾部指数と脱出集合の Radon 測度に比例してスケールする： Γ = O(ε^{-α}/m(W)).
Adam の適応スケーリングは脱出集合の測度 m(W) を減らし、ノイズ尾部を平滑化して脱出ダイナミクスに影響を与え、Γ を増加させて特定の盆地へ偏らせる。
SGD は局所的に不安定になりやすく、より大きな Radon 測度を持つ平坦または非対称の盆地へ脱出する傾向があり、これがより良い一般化につながる。
経験的な結果はヘビー尾ノイズ仮定と脱出ダイナミクスに関する理論的予測を裏付ける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。