QUICK REVIEW

[論文レビュー] Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train

Valeriu Codreanu, Damian Podareanu|arXiv (Cornell University)|Nov 12, 2017

Advanced Neural Network Applications参考文献 23被引用数 36

ひとこと要約

本稿では、最大104,000 x86コアを用いて大規模ミニバッチSGDを実行することで、ResNet-50のImageNet-1K向けにスケーラブルで高効率なトレーニングフレームワークを提案している。28分で77.5%のトップ1精度を達成し、90%を超えるスケーリング効率を実現した。また、モデルアーキテクチャを変更せずに精度を向上させるために、新規のCollapsed Ensemble技術を導入した。

ABSTRACT

For the past 5 years, the ILSVRC competition and the ImageNet dataset have attracted a lot of interest from the Computer Vision community, allowing for state-of-the-art accuracy to grow tremendously. This should be credited to the use of deep artificial neural network designs. As these became more complex, the storage, bandwidth, and compute requirements increased. This means that with a non-distributed approach, even when using the most high-density server available, the training process may take weeks, making it prohibitive. Furthermore, as datasets grow, the representation learning potential of deep networks grows as well by using more complex models. This synchronicity triggers a sharp increase in the computational requirements and motivates us to explore the scaling behaviour on petaflop scale supercomputers. In this paper we will describe the challenges and novel solutions needed in order to train ResNet-50 in this large scale environment. We demonstrate above 90\% scaling efficiency and a training time of 28 minutes using up to 104K x86 cores. This is supported by software tools from Intel's ecosystem. Moreover, we show that with regular 90 - 120 epoch train runs we can achieve a top-1 accuracy as high as 77\% for the unmodified ResNet-50 topology. We also introduce the novel Collapsed Ensemble (CE) technique that allows us to obtain a 77.5\% top-1 accuracy, similar to that of a ResNet-152, while training a unmodified ResNet-50 topology for the same fixed training budget. All ResNet-50 models as well as the scripts needed to replicate them will be posted shortly.

研究の動機と目的

ImageNet-1Kのような大規模データセットにおける深層残差ネットワークのトレーニング時間を短縮するが、精度を損なわないこと。
深層学習における大規模ミニバッチSGDと関連する一般化ギャップおよび収束問題を解決すること。
Intelのソフトウェアスタックとx86アーキテクチャを活用して、ペタフロップススケールのHPCシステムで高パフォーマンスかつスケーラブルなトレーニングを可能にすること。
特に大バッチトレーニングにおいて、固定トレーニング予算内でのモデル精度を向上させる技術を開発すること。
最良の精度が、最小限のアーキテクチャ変更で最適化されたトレーニング戦略によって達成可能であることを示すこと。

提案手法

数千のCPUコアにわたるデータ並列を用いて、最大65,536に達する非常に大きなグローバルバッチサイズを適用する。
大規模なローカルおよびグローバルバッチサイズに適応する修正版バッチ正則化を適用し、トレーニングの安定化を図る。
収束を維持するために、徐々に増加させるウォームアップを伴う、攻撃的で線形にスケーリングされた学習率スケジュールを採用する。
単一のトレーニング実行から得たスナップショットを再利用することでアンサンブルを形成するCollapsed Ensemble (CE) 技術を導入し、一般化性能を向上させる。
最適化の安定性を高めるために、サイクル的およびSGDRスケジュールにインspiredされた重み減衰および学習率戦略を採用する。
IntelのCaffeディストリビューションおよびHPC最適化ソフトウェアスタックを活用し、Intel Knights LandingおよびSkylakeシステム上で効率的なスケーリングを実現する。

実験結果

リサーチクエスチョン

RQ1大バッチSGDトレーニングは、一般的な一般化ギャップを伴わずにImageNet-1Kで高い精度を達成できるか？
RQ2ResNet-50のトレーニング時間を30分未満に短縮しつつ、トップ1精度を維持または向上させることは可能か？
RQ310万以上のx86コア上で90%を超えるスケーリング効率を達成するには、どのようなトレーニング技術が必要か？
RQ4単一のResNet-50アーキテクチャと固定トレーニング予算を用いて、77.5%の単一モデル精度を達成できるか？
RQ5Collapsed Ensemble技術は、ImageNet-1Kにおいて、標準的なアンサンブルおよびスナップショット手法をどの程度上回るか？

主な発見

著者らは、Collapsed Ensemble技術を用いて単一のResNet-50モデルで77.5%のトップ1精度を達成し、ResNet-152と同等の性能を示した。
最大104,000 x86コアで28分というトレーニング時間を達成し、スケーリング効率は90%を超えた。
提案された学習率スケジュールおよびトレーニング技術により、75エポックで76.5%のトップ1精度に収束した。
ImageNet-1Kで5モデルアンサンブルを用いた場合、Collapsed Ensemble法はHuangらのスナップショットアンサンブル法を上回った。
フレームワークはIntel Knights LandingおよびSkylakeアーキテクチャの両方で強力なスケーリング効率を示し、MareNostrum 4では76.5%精度のトレーニング時間が50分未満に予測された。
すべてのモデルおよびトレーニングスクリプトは、再現性を確保するため、IntelCaffe GitHubリポジトリで公開された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。