QUICK REVIEW

[論文レビュー] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Lianmin Zheng, Zhuohan Li|arXiv (Cornell University)|Jan 28, 2022

Parallel Computing and Optimization Techniques被引用数 75

ひとこと要約

Alpa は、分散ディープラーニングを加速するために、 intra-および inter-オペレーター並列性を組み合わせて階層的な実行計画を自動生成し、手動で調整されたシステムと同等以上を達成し、異種モデルとクラスタへ一般化します。

ABSTRACT

Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa

研究の動機と目的

データ、オペレーター、パイプライン並列性を統合して大規模 DL モデルの並列化を自動化する。
intra-および inter-オペレーター並列性を分離した階層的な計画空間を探索する。
効率的な分散計画を生成・実行するためのコンパイルパスとランタイムを開発する。
manually designed な設計なしで異種アーキテクチャとモデルへ一般化を可能にする。

提案手法

intra-オペレーター並列性と inter-オペレーター並列性を2つの階層レベルとして定義する。
2 レベルの並列実行計画空間を構築し、各レベルの最適化問題を定式化する。
Intra-op: シャーディング仕様とリシャーディングコストを考慮し、デバイスメッシュ上の実行コストを最小化する ILP（整数線形計画法）として問題をモデリングする。
Inter-op: モデルをステージに分割し、デバイスクラスターをメッシュに分割して、エンドツーエンドのパイプライン待機時間を最小化するDPベースのプランナーを用いる。
三つのコンパイルパスを実装する：intra-op 最適化、inter-op 最適化、ランタイムオーケストレーション。
並列化のための関数に注釈を付ける Python API を提供し、自動的に並列版へコンパイルする。

実験結果

リサーチクエスチョン

RQ1分散DLのために intra-および inter-オペレーター並列性を組み合わせた効率的な実行計画を自動生成できるか。
RQ2階層的な空間と最適化パスは、 varied なモデルとクラスタに対して手動で調整されたシステムと同等またはそれ以上の計画を生み出せるか。
RQ3 manual designed な計画なしで異種アーキテクチャとモデルへどの程度一般化できるか。
RQ4 intra-および inter-オペレーター並列性の間の通信、計算、アイドル時間のトレードオフはどのようになるか。

主な発見

Alpa は intra-および inter-オペレーター並列性を統一する二レベルの並列実行計画空間を作成する。
ILP 形式はデバイスメッシュ上の intra-op 計画を効率的に最適化し、DP ベースの inter-op プランナーはエンドツーエンドのパイプライン待機時間を最小化するために intra-op コストを活用してステージをメッシュに割り当てる。
Alpa のランタイムはメッシュ間の通信をオーケストレーションし、分散 GPU 上のエンドツーエンド実行を実現するパイプラインスケジュールをサポートする。
実データ評価では、GPT モデルと GShard MoE モデルで手動調整済みシステムと同等またはそれを上回る性能を示し、Wide-ResNet で事前設計なしの顕著なスケーリングを達成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。