QUICK REVIEW

[论文解读] Beyond Data and Model Parallelism for Deep Neural Networks

Zhihao Jia, Matei Zaharia|arXiv (Cornell University)|Jul 14, 2018

Advanced Neural Network Applications参考文献 31被引用 147

一句话总结

FlexFlow 定义了一个更广泛的 SOAP 空间（Sample-Operation-Attribute-Parameter）用于 DNN 并行化，使用一个带有 MCMC 搜索的快速执行模拟器来寻找高效策略，并在现有方法上实现了显著的吞吐量提升。

ABSTRACT

The computational requirements for training deep neural networks (DNNs) have grown to the point that it is now standard practice to parallelize training. Existing deep learning systems commonly use data or model parallelism, but unfortunately, these strategies often result in suboptimal parallelization performance. In this paper, we define a more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimensions. We also propose FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a parallelization strategy's performance and is three orders of magnitude faster than prior approaches that have to execute each strategy. We evaluate FlexFlow with six real-world DNN benchmarks on two GPU clusters and show that FlexFlow can increase training throughput by up to 3.8x over state-of-the-art approaches, even when including its search time, and also improves scalability.

研究动机与目标

推动超越数据并行和模型并行、需要更全面并行化的动机。
形式化一个更广泛的 SOAP 搜索空间，其中包含 Sample、Operation、Attribute 和 Parameter 维度。
开发一个快速执行模拟器，用于预测性能并指导优化。
提出 FlexFlow，这是一个能够自动发现并执行快速并行化策略的框架。
在真实世界的 DNN 基准测试上展示吞吐量和可扩展性改进。

提出的方法

定义 SOAP 搜索空间（Sample、Operation、Attribute、Parameter），用于跨设备对 DNN 进行并行化。
开发一个快速执行模拟器，能以低方差和高速度预测性能，从而实现广泛的搜索。
使用马尔可夫链蒙特卡洛（MCMC）优化器，根据模拟性能探索 SOAP 策略。
实现完整和增量（delta）仿真算法，以高效评估策略变化。
构建一个分布式运行时（Legion）来执行发现的并行化策略。

实验结果

研究问题

RQ1SOAP 空间是否能比传统的数据/模型并行和专家设计的策略更快地实现并行化？
RQ2与真实执行相比，FlexFlow 执行模拟器的准确性和速度如何？
RQ3在跨 GPU 集群的真实世界 DNN 基准测试中，可以实现哪些吞吐量和可扩展性提升？
RQ4在发现高效策略方面，FlexFlow 与 REINFORCE 与 OptCNN 的对比如何？
RQ5更广泛的并行化对通信成本和调度的影响是什么？

主要发现

FlexFlow 将训练吞吐量提升至多比最先进方法高出 3.8 倍。
以模拟器引导的搜索在 4 GPUs 和总计 160 节点上，找到策略需要 14–40 秒，而 REINFORCE 需要 12–27 小时。
评估中，FlexFlow 的吞吐量最高提高至 3.3 倍，通信成本降低最高可达 5 倍。
在相同硬件配置下，FlexFlow 相对 REINFORCE 提升 3.4–3.8 倍；相对于 OptCNN，提升 1.2–1.6 倍，因为支持了更广泛的 SOAP 空间。
在测量的执行中，模拟器相对于真实执行时间的相对误差低于 30%，保持了策略的时间排序。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。