QUICK REVIEW

[論文レビュー] On the Origin of Implicit Regularization in Stochastic Gradient Descent

Samuel Smith, Benoît Dherin|arXiv (Cornell University)|Jan 28, 2021

Stochastic Gradient Optimization Techniques参考文献 31被引用数 40

ひとこと要約

この論文は、小さな有限学習率を用いた SGD が、暗黙の正則化項を含む修正済み損失上の勾配流のように振る舞い、ミニバッチ構造を考慮したバックワード誤差解析によって SGD のこの修正損失を導出することを示している。

ABSTRACT

For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of gradient flow on the full batch loss function. However moderately large learning rates can achieve higher test accuracies, and this generalization benefit is not explained by convergence bounds, since the learning rate which maximizes test accuracy is often larger than the learning rate which minimizes training loss. To interpret this phenomenon we prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss. This modified loss is composed of the original loss function and an implicit regularizer, which penalizes the norms of the minibatch gradients. Under mild assumptions, when the batch size is small the scale of the implicit regularization term is proportional to the ratio of the learning rate to the batch size. We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.

研究の動機と目的

有限学習率を持つ SGD の説明されていない一般化利得の動機づけ。
ミニバッチ勾配のノルムを罰する暗黙の正則化項を含む、SGD の修正損失を導出する。
暗黙の正則化項の観点から、SGD と GD の違いを説明する。
損失に暗黙の正則化項を含めることがテスト精度を改善できることを実証的に検証する。

提案手法

ミニバッチ構造に適合させたバックワード誤差解析を用いて、1エポック後の平均 SGD 反復に対する修正損失を導出する。
SGD の修正損失が C(ω) + (ε/4m) ∑_{k=0}^{m-1} ||∇Ĉ_k(ω)||^2 であることを示す、ここで Ĉ_k はミニバッチのコストである。
GD と SGD の修正損失間の関係を展開し、勾配とバッチサイズの効果を比較する。
1エポック後の SGD の期待値更新を計算して、ミニバッチの順序由来のバイアス項を特定する。
修正損失の枠組みの中で、学習率とバッチサイズの間の線形スケーリング則を示す。
暗黙の正則化項を明示的に含めることがテスト精度を向上させるという実証的証拠を提供する。

実験結果

リサーチクエスチョン

RQ1有限学習率を持つ SGD は、修正損失上の勾配流パスに従うのか？
RQ2ミニバッチ構造によって生じる SGD の暗黙の正則化項の形は何か？
RQ3暗黙の正則化項は学習率とバッチサイズにどう比例するか？
RQ4訓練損失に暗黙の正則化項を含めることで一般化を改善できるか？
RQ5SGD と GD の修正損失は、 minima と軌道がどう異なるか？

主な発見

1エポック後の平均 SGD 反復は、修正損失の勾配流経路に近く留まる。
修正 SGD 損失は C(ω) + (ε/4m) ∑_{k=0}^{m-1} ||∇Ĉ_k(ω)||^2 と詳述される。
暗黙の正則化項はミニバッチ勾配の平均二乗ノルムを罰するもので、スケールは ~ ε/(4m) である。
ミニバッチ勾配が多様であれば、暗黙の正則化項は ε/B にスケールし、バッチサイズの効果を説明する。
修正損失を明示的に最適化することで、小さな学習率でテスト精度を改善できる。
実験は、損失に暗黙の正則化項を含めるとテスト性能が向上することを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。