QUICK REVIEW

[論文レビュー] How to scale distributed deep learning?

Peter Jin, Qiaochu Yuan|arXiv (Cornell University)|Nov 14, 2016

Stochastic Gradient Optimization Techniques被引用数 53

ひとこと要約

この論文は、ImageNet上でResNetを学習する際の同期的および非同期的分散SGDを比較し、すべてのノードが互いに直接通信する分散型のパラメータサーバー不要な手法「gossiping SGD」を提案する。gossiping SGDは、すべてのノードがランダムにペairを形成して局所平均化することで、すべてのノードに分散する全集約（all-reduce）を置き換える。非同期手法（gossiping SGD やエラスティック平均化を含む）は、32ノード程度までの小規模スケールで、同期的すべてのノードに分散する全集約（all-reduce）が、32ノードを超える大規模スケールで、より優れた収束性能を示す。

ABSTRACT

Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS). To minimize training time, the training of a deep neural network must be scaled beyond a single machine to as many machines as possible by distributing the optimization method used for training. While a number of approaches have been proposed for distributed stochastic gradient descent (SGD), at the current time synchronous approaches to distributed SGD appear to be showing the greatest performance at large scale. Synchronous scaling of SGD suffers from the need to synchronize all processors on each gradient step and is not resilient in the face of failing or lagging processors. In asynchronous approaches using parameter servers, training is slowed by contention to the parameter server. In this paper we compare the convergence of synchronous and asynchronous SGD for training a modern ResNet network architecture on the ImageNet classification problem. We also propose an asynchronous method, gossiping SGD, that aims to retain the positive features of both systems by replacing the all-reduce collective operation of synchronous training with a gossip aggregation algorithm. We find, perhaps counterintuitively, that asynchronous SGD, including both elastic averaging and gossiping, converges faster at fewer nodes (up to about 32 nodes), whereas synchronous SGD scales better to more nodes (up to about 100 nodes).

研究の動機と目的

大規模な分散型ディープラーニングにおける同期的SGDと非同期的SGDの収束性能を評価すること。
大規模学習における同期的SGDの限界（遅延ノードの影響）と非同期的SGDの限界（パラメータサーバーのボトルネック）を解決すること。
通信オーバーヘッドを低減するパラメータサーバーなしの分散型代替手法としてgossiping SGDを提案・評価すること。
利用可能なノード数に基づいて最適な分散SGD戦略を特定すること。
ノード数と収束速度に基づいて、実践的な研究者に対して同期的と非同期的学習の選択に関する実証的ガイダンスを提供すること。

提案手法

パラメータサーバーを廃止し、ランダムに選ばれたノードペア間での局所平均化により、中央集権的なパラメータサーバーを置き換える非同期的手法「gossiping SGD」を提案する。
2段階の更新ステップを採用：まず局所勾配を計算し、次にランダムに選ばれた隣接ノードとの確率的平均化によりパラメータを更新する。
一貫性に類似した更新式を採用：$\theta_{i,t+1} = (1-\beta)\theta_{i,t}^\prime + \beta\theta_{j_{i,t},t}^\prime$、ここで$\theta_{i,t}^\prime$は勾配ステップ、$j_{i,t}$はランダムに選ばれたノードである。
同期の障害や中央集権的ボトルネックを回避する分散型gossipプロトコルを実装する。
3つの手法を比較：同期的すべてのノードに分散する全集約SGD、非同期的エラスティック平均化SGD、および提案されたgossiping SGD。
高速なCUDAベースの画像リサイズを用いたスケールに応じたデータ拡張を採用し、学習におけるI/Oオーバーヘッドを低減する。

実験結果

リサーチクエスチョン

RQ1ノード数が異なる条件下で、非同期的SGDの収束速度は同期的SGDと比べてどの程度か？
RQ2分散型の代替手法としてのgossiping SGDは、中央集権的パラメータサーバー方式に比べ、収束性とスケーラビリティで優れているか？
RQ3どの規模（ノード数）で、同期的すべてのノードに分散する全集約SGDが非同期手法を上回る収束速度を示すか？
RQ4大きなステップサイズと小さなステップサイズの両方において、同期的SGDと非同期的SGDの性能にどのような影響があるか？
RQ5gossiping SGDは、小規模から中規模のノード数において、エラスティック平均化やすべてのノードに分散する全集約SGDよりも速やかに収束するか？

主な発見

非同期的SGD（エラスティック平均化およびgossiping SGDを含む）は、32ノード程度までの小スケールで、同期的すべてのノードに分散する全集約SGDよりも速やかに収束する。
同期的すべてのノードに分散する全集約SGDは、100ノード程度までの大きなノード数にまでスケーリング可能であり、大規模スケールでより速い収束性能を示す。
gossiping SGDは、分散型でありながら同期の障害がないにもかかわらず、小スケールのノード数において、すべてのノードに分散する全集約SGDよりも速い収束を達成する。
大きなステップサイズでは非同期手法が速く収束するが、小さなステップサイズではすべてのノードに分散する全集約SGDがより速く高い精度に到達する。
非同期性と同期性の性能のトレードオフはノード数に依存する：小スケールでは非同期が優位であり、大スケールでは同期が優位である。
提案されたgossiping SGDは、パラメータサーバーのボトルネックを回避し、遅延ノードの影響も排除するため、実用的で堅牢かつスケーラブルである。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。