QUICK REVIEW

[論文レビュー] A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

Akhilesh Deepak Gotmare, Nitish Shirish Keskar|arXiv (Cornell University)|Oct 29, 2018

Neural Networks and Applications被引用数 107

ひとこと要約

この論文は、モード接続性と SVCCA を用いて、コサイン学習率再開、学習率ウォームアップ、および知識蒸留を実証的に分析し、深層ネットワークの訓練ダイナミクスと表現を理解する。

ABSTRACT

The convergence rate and final performance of common deep learning models have significantly benefited from heuristics such as learning rate schedules, knowledge distillation, skip connections, and normalization layers. In the absence of theoretical underpinnings, controlled experiments aimed at explaining these strategies can aid our understanding of deep learning landscapes and the training dynamics. Existing approaches for empirical analysis rely on tools of linear interpolation and visualizations with dimensionality reduction, each with their limitations. Instead, we revisit such analysis of heuristics through the lens of recently proposed methods for loss surface and representation analysis, viz., mode connectivity and canonical correlation analysis (CCA), and hypothesize reasons for the success of the heuristics. In particular, we explore knowledge distillation and learning rate heuristics of (cosine) restarts and warmup using mode connectivity and CCA. Our empirical analysis suggests that: (a) the reasons often quoted for the success of cosine annealing are not evidenced in practice; (b) that the effect of learning rate warmup is to prevent the deeper layers from creating training instability; and (c) that the latent knowledge shared by the teacher is primarily disbursed to the deeper layers.

研究の動機と目的

経験的な成功を超えた、広く用いられる深層学習のヒューリスティクスの理解を動機づける。
現代的な解析手法を用いて、コサインアニーリング / SGDR、学習率ウォームアップ、知識蒸留を調査する。
これらのヒューリスティクスが損失面とネットワーク層全体の表現にどのように影響するかを評価する。
訓練中にこれらのヒューリスティクスがどこで、どう影響を及ぼすのかについて洞察を提供する。

提案手法

異なる訓練計画の下で見つかった最適解をモード接続性で結び、得られた曲線と障壁を分析する。
SVCCA を用いて（効率のために SVD/DFT 前処理付き）ネットワーク間および訓練反復間の活性化の表現類似度を測定する。
学習率スケジュールを通じて SGDR を特徴づけ、再起動あり/なしの標準 SGD と比較する。
ウォームアップと蒸留シナリオにおける層別活性化の進化を研究するために CCA を用いる。
CIFAR-10 を用いた VGG-16/ResNet 系統の制御実験を実施し、層間のダイナミクスを観察する。

実験結果

リサーチクエスチョン

RQ1コサインアニーリング / SGDR の再起動は損失ランドスケープに障壁を作成または跨ぐのか、そしてこの点は彼らの成功に不可欠なのか？
RQ2学習率ウォームアップは安定性にどのように影響し、どのネットワーク層が最も影響を受けるのか？
RQ3蒸留で転移される知識は学生ネットワークの表現のどこに現れるのか？
RQ4これらのヒューリスティクスの下での訓練ダイナミクスについて、モード接続性と SVCCA は何を明らかにするのか？

主な発見

コサインアニーリングの利点は、障壁を回避することとして一貫して証拠付けられてはいない；再起動後に反復が障壁を超えるが、これだけでは利点を十分に説明できない可能性がある。
学習率ウォームアップは主に深い層の重み変化を制限し、それらの層を凍結すると同等の安定性を得られる。
蒸留における教師の潜在知識は主に学生の深い（識別的）層に分散される。
表現の類似性分析は、訓練後に浅い層の活性化がより似ており、深い層ほど差異化された表現を持つことを示している。
モード接続性は、多様な最適解間に高精度な接続曲線を示し、訓練選択肢間で結合された損失ランドスケープを示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。