[논문 리뷰] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
Alpa 자동으로 계층형 실행 계획을 생성하여 intra- 및 inter-연산자 병렬성의 결합으로 분산 딥러닝을 가속화하고 핸드 튜닝 시스템과 유사하거나 이를 능가하며 이종 모델과 클러스터에 일반화합니다.
Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa
연구 동기 및 목표
- Automate parallelization of large DL models by unifying data, operator, and pipeline parallelism.
- Explore a hierarchical space of plans that separates intra- and inter-operator parallelism.
- Develop compilation passes and runtime to generate and execute efficient distributed plans.
- Enable generalization to heterogeneous architectures and models without manually designed plans.
제안 방법
- Define intra-operator and inter-operator parallelism as two hierarchical levels.
- Construct a two-level parallel execution plan space and formulate optimization problems for each level.
- Intra-op: model the problem as an ILP to minimize execution cost on a device mesh, considering sharding specs and resharding costs.
- Inter-op: slice the model into stages and partition device clusters into meshes, using dynamic programming (DP) to minimize end-to-end pipeline latency.
- Implement three compilation passes: intra-op optimization, inter-op optimization, and runtime orchestration.
- Provide a Python API that annotates functions for parallelization and automatically compiles to a parallel version.
실험 결과
연구 질문
- RQ1How can we automatically generate efficient parallel execution plans that combine intra- and inter-operator parallelism for distributed DL?
- RQ2Can a hierarchical space and optimization passes yield plans that match or exceed hand-tuned systems across varied models and clusters?
- RQ3How well does the approach generalize to heterogeneous architectures and models without manually designed plans?
- RQ4What are the trade-offs between intra- and inter-operator parallelism in terms of communication, compute, and idle time?
주요 결과
- Alpa creates a two-level parallel execution plan space that unifies intra- and inter-operator parallelism.
- An ILP formulation efficiently optimizes intra-op plans on a device mesh, and a DP-based inter-op planner assigns stages to meshes while leveraging intra-op costs.
- Alpa’s runtime orchestrates cross-mesh communication and supports pipeline schedules to achieve end-to-end execution on distributed GPUs.
- Empirical evaluation shows Alpa matches or outperforms hand-tuned systems on GPT models and GShard MoE models, and achieves notable scaling on Wide-ResNet without manual plans.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.