QUICK REVIEW

[論文レビュー] The Heavy-Tail Phenomenon in SGD

Mert Gürbüzbalaban, Umut Şimşekli|arXiv (Cornell University)|Jun 8, 2020

Stochastic Gradient Optimization Techniques参考文献 56被引用数 38

ひとこと要約

この論文は、重み付き勾配降下法(SGD)の反復が、2次設定において重尾的な定常分布へ収束し得ることを証明し、尾の重さはステップサイズ、バッチサイズ、次元、および曲率によって決まり、ニューラルネットワークの実験と整合する。

ABSTRACT

In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize $η$ to the batch-size $b$, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the `tail-index', which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $η$ and $b$, the SGD iterates will converge to a \emph{heavy-tailed} stationary distribution. We rigorously prove this claim in the setting of quadratic optimization: we show that even in a simple linear regression problem with independent and identically distributed data whose distribution has finite moments of all order, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We support our theory with experiments conducted on synthetic data, fully connected, and convolutional neural networks.

研究の動機と目的

深層学習における SGD の一般化と能力（capacity）および複雑さの概念がどのように関連するかを動機づける。
SGD の反復が特定のアルゴリズム的・問題設定下で heavy-tailed な定常分布へ収束し得ることを示す。
尾部の重さがステップサイズ、バッチサイズ、次元、および曲率にどう依存するかを特徴付ける。
線形/2次の設定で厳密な結果を提供し、それを深層学習の観察に結びつける。
synthetic data とニューラルネットワークでの実験によって理論を裏付ける。

提案手法

Model SGD as an iterated random recursion x_k = Psi_Omega_k(x_{k-1}).
Approximate SGD near a quadratic minimum by an affine recursion x_k ≈ (I - (eta/b) H_k) x_{k-1} + q_k.
Apply implicit renewal theory and stochastic matrix recursions to derive the tail-index alpha via h(alpha)=1.
Show that under Gaussian input the tail-index increases with batch size and decreases with stepsize and variance.
Establish three regimes for stepsize and convergence based on alpha relative to 2.
Provide non-asymptotic moment bounds and Wasserstein distance convergence results.

実験結果

リサーチクエスチョン

RQ1標準的な二次/線形回帰設定の下で、SGD は重尾を持つ定常分布へ収束するか？
RQ2ステップサイズ、バッチサイズ、次元、曲率は SGD の定常分布の尾指標にどのように影響するか？
RQ3ガウスデータモデルおよび非ガウスデータモデルにおいて、尾の重さをアルゴリズムパラメータに明示的に結びつけられるか？
RQ4深層学習における重尾の影響は収束速度と一般化にどのように影響するか？
RQ5ニューラルネットワークの実験結果は理論的な重尾挙動を裏付けるか？

主な発見

SGD の反復は、軽い尾を持つデータであっても、二次/線形回帰において分散が無限となる重尾の定常分布を持つことがある。
h(alpha)=1 を満たす唯一の正の alpha > 0 が存在し、尾減衰を決定する。u^T x_infty は次数が alpha の多項式尾を持つ。
ガウス入力では、尾の重さは曲率と eta/b の比が増えると増加し、バッチサイズ b が大きくなると減少する。
rho < 0 のとき、Wasserstein 距離で定常分布は指数的速度で平衡へ収束する。
eta と b によって、3つのレジームを同定: 有限分散（alpha > 2）、重尾（alpha < 2）、潜在的な発散（rho >= 0）。
alpha <= 1 のとき x_k の特定のモーメントは有限のまま; alpha > 1 のとき、高次モーメントが明示的な境界で制御される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。