QUICK REVIEW

[論文レビュー] Horovod: fast and easy distributed deep learning in TensorFlow

Alexander Sergeev, Mike Del Balso|arXiv (Cornell University)|Feb 15, 2018

Advanced Neural Network Applications参考文献 5被引用数 522

ひとこと要約

Horovodはリング・オールレデュースに基づく分散TensorFlowフレームワークを導入し、コード変更を劇的に削減し、スケーリングを改善して、複数GPUにわたるほぼ線形のスピードアップを実現します。独立したPythonパッケージを提供し、NCCLベースの通信とデバッグ/プロファイリングツールを備えています。

ABSTRACT

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at https://github.com/uber/horovod

研究の動機と目的

Uberにおけるスケーラブルな分散TensorFlowトレーニングの必要性を動機づけ、2つの主要な障壁を特定する：GPU間通信のオーバーヘッドとユーザーコードの複雑さ。
スケーラビリティと単純さに対処するため、リング・オールレデュースベースのアプローチを提案する。
Horovodのアーキテクチャ、TensorFlow/Kerasとの統合、およびユーザーの編集を最小化するAPI設計を説明する。
実用的なツール（Horovod Timeline）と最適化（Tensor Fusion）を示し、使いやすさと性能を向上させる。

提案手法

Baiduのドラフトの ring-allreduce を採用し、跨GPUおよび跨マシン通信を最適化するため NVIDIA NCCL に置換した。
Horovod を特定の TensorFlow リリースから分離する独立した Python パッケージとして実装した。
単一サーバー内に収まるモデル（潜在的に複数GPUを含む）をサポート対象に拡張した。
ワーカー間の一貫した起動を保証するブロードキャスト初期化フックを導入した。
hvd.DistributedOptimizer でオプティマイザをラップし、rank 0 から変数をブロードキャストするなど、最小限の API 面を提供した。
Horovod Timeline を組み込み、ノード間のプロファイリングとデバッグを支援した。
Tensor Fusion を開発し、allreduce 前に小さなテンソルを大きなバッファに結合して、TCPネットワーク上のスループットを向上させた。

実験結果

リサーチクエスチョン

RQ1リング・オールレデュースベースの通信は、複数のGPUとマシンにまたがる TensorFlow のトレーニングでほぼ線形スケーリングを提供できるか？
RQ2単一GPUの TensorFlow プログラムを分散 Horovod プログラムに転換するには、コードの変更はどれくらい必要か？
RQ3実務的なツールと最適化（Tensor Fusion や Timeline など）は、実際のワークフローで使いやすさと性能をどのように向上させるか？
RQ4TCP vs RDMA ネットワークでの Horovod の性能特性、およびパラメータ数が異なるモデルに対してどのような影響があるか？
RQ5効率とリソース使用量の観点で、Horovod は標準の分散 TensorFlow とどう比較されるか？

主な発見

Horovod は標準の分散 TensorFlow に対して顕著なスケーリング改善を達成し、ベンチマークで最大で 88% の効率が報告されている。
複数GPUで Horovod を使用すると、標準の分散 TensorFlow に比べてトレーニング速度がほぼ倍増することがある。
RDMA ネットワーキングは一部のモデルで控えめな向上をもたらし（3-4%の追加）、特定のアーキテクチャではスケーリング効率を90%以上に押し上げられる。
Tensor Fusion により多くの小さなテンソル演算を持つモデルでは通信オーバーヘッドを減らして最大で65%の改善をもたらす。
Horovod はセットアップと統合の労力を数行のコード変更に削減し、チーム全体での採用を容易にする。
Horovod Timeline は高レベルでブラウザからアクセス可能なプロファイリングを提供し、デバッグとパフォーマンス分析を支援する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。