QUICK REVIEW

[論文レビュー] How Does Learning Rate Decay Help Modern Neural Networks?

Kaichao You, Mingsheng Long|arXiv (Cornell University)|Aug 5, 2019

Neural Networks and Applications参考文献 42被引用数 161

ひとこと要約

この論文は、学習率の減衰が現代のニューラルネットワークの学習を助けるのは、複雑なパターンの学習を可能にする一方で、初期に大きな学習率がノイズデータの記憶を防ぐのに役立つと主張する。制御実験と実データセット上の転移可能性分析によってこのパターン-複雑性ビューを検証している。

ABSTRACT

Learning rate decay (lrDecay) is a \emph{de facto} technique for training modern neural networks. It starts with a large learning rate and then decays it multiple times. It is empirically observed to help both optimization and generalization. Common beliefs in how lrDecay works come from the optimization analysis of (Stochastic) Gradient Descent: 1) an initially large learning rate accelerates training or helps the network escape spurious local minima; 2) decaying the learning rate helps the network converge to a local minimum and avoid oscillation. Despite the popularity of these common beliefs, experiments suggest that they are insufficient in explaining the general effectiveness of lrDecay in training modern neural networks that are deep, wide, and nonconvex. We provide another novel explanation: an initially large learning rate suppresses the network from memorizing noisy data while decaying the learning rate improves the learning of complex patterns. The proposed explanation is validated on a carefully-constructed dataset with tractable pattern complexity. And its implication, that additional patterns learned in later stages of lrDecay are more complex and thus less transferable, is justified in real-world datasets. We believe that this alternative explanation will shed light into the design of better training strategies for modern neural networks.

研究の動機と目的

深層ネットワークで lrDecay が機能する理由の一般的説明に挑む。
lrDecay 効果のパターン複雑性ベースの見解を提案する。
取り扱い可能なデータセットで制御された実験を用いて提案ビューを検証する。
データセット間で学習したパターンの転移可能性への影響を検証する。

提案手法

CIFAR-10で訓練された WideResNet に関する経験的結果と lrDecay の GD/SGD 説明を批判的に比較する。
単純パターンと複雑パターンを分離する Pattern Separation 10 (PS10) データセットを構築する。
パターンの複雑さを「期待クラス条件付きエントロピー」と定義し、lrDecay の下で単純パターンと複雑パターンの学習を測定する。
後期段階のパターンがターゲットデータセットへ転移する様子を評価する転移学習実験を用いる。
学習ダイナミクスと減衰の関連性について主張するためにヘシアン固有値を分析する。

実験結果

リサーチクエスチョン

RQ1現代のネットワークにおける lrDecay の利点を、GD/SGD の説明が完全に説明しているか？
RQ2lrDecay は主に複雑なパターンの学習を助けるのか、それとも大きな初期 LR がノイズデータの memorization を抑制するのか？
RQ3観測された訓練および転移現象を、パターン複雑性の枠組みで説明できるか？
RQ4異なる訓練段階で学習されたパターンの転移可能性に lrDecay はどう影響するか？
RQ5実データセットは、後期 lrDecay 段階で学習されたパターンの転移性が低下することを示すか？

主な発見

lrDecay は収束や局所極小の回避だけでなく、複雑なパターンの学習を改善する。
初期に大きな学習率はノイズデータの memorization を避けるのを助け、正則化として機能する。
計算的に扱いやすい PS10 データセットは、単純パターンが最初に学習され、減衰が複雑パターンの学習を高めることを示す。
実データセットでは、後期 lrDecay 段階で学習されるパターンの転移性がタスク間で低くなる。
ImageNetとターゲットデータセット間の転移性分析は、新しいパターンほど転移性が低くなることを示し、複雑性ベースの見解を支持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。