QUICK REVIEW

[论文解读] Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Lianmin Zheng, Zhuohan Li|arXiv (Cornell University)|Jan 28, 2022

Parallel Computing and Optimization Techniques被引用 75

一句话总结

Alpa 自动生成将内部/算子内并行与跨算子并行相结合的分层执行计划，以加速分布式深度学习，达到或超过手工调优系统并对异构模型和集群具有泛化能力。

ABSTRACT

Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa

研究动机与目标

通过统一数据并行、算子并行和流水线并行性来实现对大型 DL 模型的并行化自动化。
探索将 intra-operator 和 inter-operator 并行分离的分层计划空间。
开发编译阶段和运行时、以生成并执行高效的分布式计划。
实现对异构体系结构和模型在无需手工设计计划的情况下的泛化能力。

提出的方法

将 intra-operator 与 inter-operator 并行性定义为两个层次结构。
构建两级并行执行计划空间，并为每个层次形式化优化问题。
Intra-op：将问题建模为一个整数线性规划（ILP），以在设备网格上最小化执行成本，考虑分片规范和再分片成本。
Inter-op：将模型切分为若干阶段，并将设备簇划分为网格，使用动态规划（DP）来最小化端到端流水线延迟。
实现三种编译阶段：intra-op 优化、inter-op 优化和运行时编排。
提供一个 Python API，用于对函数进行并行化标注并自动编译为并行版本。

实验结果

研究问题

RQ1我们如何自动生成结合 intra-和 inter-operator 并行性的高效并行执行计划，用于分布式 DL？
RQ2一个分层空间和优化阶段是否能够产生与手工调优系统相匹配甚至超过其在不同模型和集群上的表现的计划？
RQ3该方法在无需手动设计计划的情况下，对异构体系结构和模型的泛化能力如何？
RQ4在通信、计算和空闲时间方面，intra- 和 inter-算子并行之间有哪些权衡？

主要发现

Alpa 创建了一个将 intra- 和 inter-算子并行统一在一个两级并行执行计划空间中的方法。
一个 ILP 公式在设备网格上高效地优化 intra-op 计划，基于 DP 的 inter-op 规划器在分配阶段到网格时利用 intra-op 成本。
Alpa 的运行时编排跨网格通信并支持流水线调度，以实现分布式 GPU 的端到端执行。
经验评估表明，Alpa 在 GPT 模型和 GShard MoE 模型上匹配或优于手工调优系统，并且在 Wide-ResNet 上实现了显著的扩展性且无需手动计划。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。