QUICK REVIEW

[論文レビュー] Large scale distributed neural network training through online distillation

Rohan Anil, Gabriel Pereyra|arXiv (Cornell University)|Apr 9, 2018

Machine Learning and Data Classification被引用数 206

ひとこと要約

Codistillation は、古い予測を用いた蒸留項を用いて並列に複数のモデルコピーを訓練することで、SGD の限界を超える高速な訓練を可能にし、テスト時コストを追加せずに再現性を改善します。Common Crawl の言語モデリング、ImageNet、Criteo で検証されています。

ABSTRACT

Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted. Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible. We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6\ imes 10^{11}$ tokens and based on the Common Crawl repository of web data.

研究の動機と目的

巨大なニューラルネットワークの分散 SGD の実用上の限界を超えたスケーラブルな訓練を動機付ける。
複数モデルの同時訓練を用いるオンライン蒸留変種である codistillation を導入する。
予測の追加的な並列性を活用して訓練を加速し、テスト時コストを追加しないことを示す。
アンサンブルやオフライン蒸留と比較して再現性の向上と予測の churn の低減を示す。
codistillation の設計上の選択と実装上の考慮事項に関する実用的な指針を提供する。

提案手法

ローカルに分割されたデータ上で中心集約的な勾配共有を行わずに、n コピーのモデルを並列に訓練する。
各モデルの目的関数に、他のモデルの平均予測と同意を促す蒸留損失項を追加する。
初期のベell-in 期間後に蒸留項を有効にして、モデルの多様性を保つ。
グループ間でチェックポイントを交換することで、標準的な分散 SGD と codistillation を組み合わせることも可能。
重みのチェックポイントの代わりに予測を交換する予測サーバーのような代替案について検討する。
codistillation は古くなる予測の使用に対して堅牢であり、追加的な通信を最小限に抑えることが可能である。

実験結果

リサーチクエスチョン

RQ1オンライン codistillation は分散 SGD のみで達成可能な範囲を超える訓練の加速を実現できるのか。
RQ2codistillation は SGD、ラベル平滑化、アンサンブルなどのベースラインと比較して最終的なモデル精度を維持または改善するのか。
RQ3古くなった予測の使用は訓練の安定性と最終的な性能にどう影響するのか。
RQ4codistillation は再訓練やバージョン間での予測 churn を減らすことができるのか。
RQ5データ分割、チェックポイント交換頻度など、codistillation の利点を最大化する実践的な設計選択は何か。

主な発見

128GPU を用いた2者間 codistillation は、ベースラインと同じ検証誤差に到達するまでの訓練ステップを約半減させ、最終的な誤差をより低く達成できる。
Common Crawl の言語モデリングでは、2者間 codistillation は2者間エンサンブルの訓練曲線に近づき、同等またはより良い精度を約半分のステップで達成する。
ImageNet では、2者間 codistillation が 5250 ステップで 75% の精度に到達し、ベースラインの 7250 ステップと比較して訓練ステップの削減を確認した。
codistillation は古い予測を許容しても健全に動作する；チェックポイントの再ロード間隔を最大限に増やしても、劣化は穏やかな程度に留まる。
予測 churn は codistillation で 35% 減少し、サービスコストを増やさずにアンサンブルと同等の再現性を実現した。
codistillation によって、codistilling モデル間で異なるデータサブセットを使用する方が、同じデータを使用する場合よりも大きな利得を生むことが示され、 diverse なデータ部分について有益な情報共有が行われることを示唆している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。