QUICK REVIEW

[論文レビュー] Why gradient clipping accelerates training: A theoretical justification for adaptivity

Jingzhao Zhang, Tianxing He|arXiv (Cornell University)|May 28, 2019

Stochastic Gradient Optimization Techniques参考文献 53被引用数 85

ひとこと要約

論文は、勾配ノルムと共にヘッセ行列の境界が成長する緩和された滑らかさ条件（L0,L1-滑らかさ）を導入し、この条件下でクリップされた勾配降下法と正規化勾配が固定ステップの勾配降下法よりも速く収束することを証明し、NLPとCVの実験によって経験的サポートを提供する。

ABSTRACT

We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, \emph{gradient clipping} and \emph{normalized gradient}, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings.

研究の動機と目的

ニューラルネットワーク訓練、特にNLPとCVタスクにおいて適応的勾配法がなぜ良く機能するのか動機づける。
勾配ノルムとともに局所的な滑らかさが成長することを許す緩和された滑らかさ条件を導入し、それがLipschitz勾配より弱い：**L-smoothness**。」
新しい滑らかさ条件の下でクリップされた勾配降下法と正規化勾配の収束保証を示す。
言語モデリングと画像分類の経験的証拠を提供し、理論を検証し、加速機構を示す。

提案手法

新しい (L0,L1)-滑らかさを定義： ||∇2f(x)|| ≤ L0 + L1||∇f(x)||。
この緩和された滑らかさの下で勾配ベースの手法を分析する：クリップされた GD と正規化 GD (NGD) を含む。
収束率と境界を証明する：定常的なクリップド GD の定理3；固定ステップとクリップド GD を比較する定理4と 6。
Assumptions 1–5 を用いて確率的設定に拡張し、確率的クリップド GD と SGD を比較する定理7と8を導出。
クリップド GD を NGD に結びつけ、ステップサイズが定数まで等価であることを示す。
ニューラルネットの訓練直感と勾配ノルムが局所滑らかさと相関するという経験的観察で理論を裏付ける。

実験結果

リサーチクエスチョン

RQ1提案された (L0,L1)-滑らかさ条件は標準の Lipschitz 滑らかさと比較してニューラルネットの損失景観を十分に説明できるか？
RQ2新しい滑らかさ仮定の下でクリップされた勾配降下法と正規化勾配降下法は固定ステップの勾配降下法より速く収束できるか？
RQ3緩和された滑らかさの下で確率的なクリップド GD は、固定ステップの SGD より速く収束するか？
RQ4NLPとCVタスクの経験的訓練ダイナミクスは、勾配ノルムと局所滑らかさとの予測される相関を示すか？
RQ5言語モデリングと画像分類の実験は理論的なスピードアップを裏付けるか？

主な発見

(L0,L1)-滑らかさ条件下でクリップド GD は固定ステップ GD より任意に速く収束する。
緩和された滑らかさの下で、確率的クリップド GD は固定ステップの SGD より速い収束を示す。
NLP の訓練実験は勾配ノルムと局所滑らかさの強い相関を明らかにし、理論を支持する。
LSTM 言語モデリングと CIFAR-10 の ResNet20 の実験は、クリッピングが訓練を加速し、ベースラインの性能と同等またはそれを上回ることがある。
勾配クリッピングは非滑らかな領域を通過する訓練経路を可能にし、実践での収束を速くする理由を説明する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。