QUICK REVIEW

[论文解读] HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

Jay Park, Gyeongchan Yun|arXiv (Cornell University)|May 28, 2020

Advanced Neural Network Applications被引用 41

一句话总结

HetPipe 将流水线式模型并行与数据并行结合，在异构 GPU 上训练大型 DNNs（包括较弱的 GPU），在实验中实现比现有 DP 先进 49% 的收敛速度提升。

ABSTRACT

Deep Neural Network (DNN) models have continuously been growing in size in order to improve the accuracy and quality of the models. Moreover, for training of large DNN models, the use of heterogeneous GPUs is inevitable due to the short release cycle of new GPU architectures. In this paper, we investigate how to enable training of large DNN models on a heterogeneous GPU cluster that possibly includes whimpy GPUs that, as a standalone, could not be used for training. We present a DNN training system, HetPipe (Heterogeneous Pipeline), that integrates pipelined model parallelism (PMP) with data parallelism (DP). In HetPipe, a group of multiple GPUs, called a virtual worker, processes minibatches in a pipelined manner, and multiple such virtual workers employ data parallelism for higher performance. We also propose a novel parameter synchronization model, which we refer to as Wave Synchronous Parallel (WSP) to accommodate both PMP and DP for virtual workers, and provide convergence proof of WSP. Our experimental results on a given heterogeneous setting show that with HetPipe, DNN models converge up to 49% faster compared to the state-of-the-art DP technique.

研究动机与目标

在混合（潜在较弱）GPU 的异构 GPU 集群上推动大型 DNN 的训练。
通过在虚拟工作者内组合 PMP、在虚拟工作者之间进行 DP，实现高效利用。
提供适用于异构流水线训练的、具备收敛保证的同步模型。

提出的方法

引入由多块 GPU 组成的虚拟工作者，以在异质性下实现数据并行。
将 DNN 模型分割为 k 个分区，在每个虚拟工作者内进行 PMP 形成流水线。
提出 Wave Synchronous Parallel (WSP) 作为一个按波聚合更新、具备收敛保证的同步模型。
使用参数服务器进行全局权重同步，带有有界的全局滞后。
给出 WSP 的收敛性证明。
修改 TensorFlow 以实现 HetPipe，并在一个含四块 GPU 的异构集群上评估。

实验结果

研究问题

RQ1能否通过结合 PMP 与 DP，在异构 GPU 集群上高效训练大型 DNN 模型？
RQ2在 HetPipe 中，如何分配和分区 GPU 资源以最大化流水线性能？
RQ3在异构性和流水线执行下，Wave Synchronous Parallel 是否能确保收敛？
RQ4相对于 Horovod 采用 AllReduce 的最新 DP 方法，取得了哪些性能提升？
RQ5在异构 DP+PMP 设置中，HetPipe 如何处理全局和局部滞后？

主要发现

在其异构集群设置中，HetPipe 的收敛速度比基于 Horovod 的 DP 快 49%（VGG-19）和 39%（ResNet-152）。
通过形成虚拟工作者，HetPipe 使得较大的模型能够在单个较弱 GPU 无法承载时进行训练。
虚拟工作者内的 PMP 和虚拟工作者之间的 DP 提高了对异构 GPU 的利用率。
WSP 为结合 PMP 与 DP 的设置提供收敛保证，且具有有界滞后。
通过按波聚合更新而非按小批量聚合，降低了通信开销，从而减少全局同步。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。