QUICK REVIEW

[論文レビュー] Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Elad Hoffer, Itay Hubara|arXiv (Cornell University)|May 24, 2017

Domain Adaptation and Few-Shot Learning参考文献 38被引用数 418

ひとこと要約

この論文は、大きなバッチ SGD における汎化ギャップはバッチサイズではなく更新回数が少なすぎることに起因すると主張し、学習率スケーリング、Ghost Batch Normalization、レジーム適応がギャップを縮める方法を示す。

ABSTRACT

Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the "generalization gap" phenomena. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase. We find that the weight distance from its initialization grows logarithmically with the number of weight updates. We therefore propose a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior. Following this hypothesis we conducted experiments to show empirically that the "generalization gap" stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting the training regime used. We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. To validate our findings we conduct several additional experiments on MNIST, CIFAR-10, CIFAR-100 and ImageNet. Finally, we reassess common practices and beliefs concerning training of deep models and suggest they may not be optimal to achieve good generalization.

研究の動機と目的

ニューラルネットワークにおける大容量バッチ訓練で観測される汎化ギャップを動機づけ、特徴づける。
初期のトレーニングにおける重みのダイナミクスを説明するための確率的最適化モデル（random walk on a random potential）を提案。
ギャップを埋める実用的な手法を開発する：学習率のスケーリング、Ghost Batch Normalization（GBN）、および regime adaptation。
複数のアーキテクチャに渡って、MNIST、CIFAR-10/100、ImageNet を対象に経験的に検証する。
トレーニング手法を再評価し、汎化はバッチサイズだけでなく更新回数に依存することを強調する。

提案手法

SGD を random potential 上の random walk としてモデル化し、重みの超遅い拡散を説明する。
初期値からの重みの距離が更新回数に対して対数的に増加すること（おおよそ log t）を導出し、拡散速度とバッチサイズの関係を結びつける。
更新統計を保持するため、バッチサイズ M に対して学習率をスケーリング（η ∝ sqrt(M)）する提案。
大きなバッチの中で小さな ghost バッチ上でBN統計を計算する Ghost Batch Normalization を導入。
トレーニングイテレーションを拡張して、バッチサイズ間で更新回数を比較可能にするレジーム適応を提唱。
標準データセットとネットワークを用いて経験的に検証し、SB/LB 実装での精度向上を報告。

実験結果

リサーチクエスチョン

RQ1総トレーニング時間を増やさずに、大バッチ訓練で観測される汎化ギャップを解消できるか。
RQ2初期のトレーニングでの重み更新が最終的な汎化にどのように影響するかを説明するメカニズムは何か、またバッチサイズと更新回数の相互作用はどうなるか。
RQ3学習率スケーリングや Ghost Batch Normalization といった調整は、アーキテクチャやデータセットを超えて一貫して汎化ギャップを縮小または解消するか。
RQ4大バッチの訓練 regime を拡張することで小バッチの汎化性能に追いつくことは可能か。

主な発見

ネットワーク	データセット	SB	LB	+LR	+GBN	+RA
F1 (Keskar et al., 2017)	MNIST	98.27%	97.05%	97.55%	97.60%	98.53%
C1 (Keskar et al., 2017)	CIFAR-10	87.80%	83.95%	86.15%	86.40%	88.20%
Resnet44 (He et al., 2016)	CIFAR-10	92.83%	86.10%	89.30%	90.50%	93.07%
VGG (Simonyan, 2014)	CIFAR-10	92.30%	84.10%	88.60%	91.50%	93.03%
C3 (Keskar et al., 2017)	CIFAR-100	61.25%	51.50%	57.38%	57.50%	63.20%
WResnet16-4 (Zagoruyko, 2016)	CIFAR-100	73.70%	68.15%	69.05%	71.20%	73.57%

大容量バッチでの汎化ギャップは、学習率スケーリングと Ghost Batch Normalization によって大きく解消できる。
初期化からの重みの距離は更新回数とともに対数的に増加し、バッチサイズを問わず一致しており、拡散ダイナミクスが汎化を駆動するのはバッチサイズそのものよりも大きいことを示している。
学習率をバッチサイズの平方根でスケールすることが更新統計を保持し、汎化を改善するのに役立つ。
大きなバッチで訓練しつつ小さな ghost バッチを用いて Batch Normalization の統計量を計算する Ghost Batch Normalization は、汎化誤差を大幅に低減する。
重み更新回数を適応して小バッチのイテレーション数に合わせるレジーム適応はギャップを解消し、検証精度を同等またはそれ以上にする。
MNIST、CIFAR-10/100、ImageNet の実験は、+LR、+GBN、+RA から一貫した向上を示し、SB の結果と同等かそれを上回ることが多い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。