QUICK REVIEW

[论文解读] Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Tal Ben‐Nun, Torsten Hoefler|arXiv (Cornell University)|Feb 26, 2018

Stochastic Gradient Optimization Techniques参考文献 253被引用 212

一句话总结

一份对深度学习中从单算子到分布式规模训练的并发性进行的综合综述，涵盖模型及对并行策略的影响。

ABSTRACT

Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.

研究动机与目标

定义并行与分布式深度学习的术语与基础算法。
分析DNN算子、网络体系结构以及训练/推理工作流中的并发性。
评估与分布式深度学习相关的并行计算体系结构、通信方案与系统实现。
使用 Work-Depth 框架建模并发性，并识别推动并行化策略的趋势。

提出的方法

对DNN算子及其计算模型进行调查与分类。
卷积、池化和归一化算子及其相关张量数据流的表述。
讨论随机优化与权重更新规则，包括带反向传播的 SGD 与小批量 SGD。
应用 Work-Depth 模型来表征并行性并推导基于 DAG 的计算界限。
对单机与多机并行性进行分析，包括 MPI 与基于 RDMA 的通信。

实验结果

研究问题

RQ1DNN算子如何暴露并发性，其对并行性的含义是什么？
RQ2在小批量 SGD 中并发性、精度和硬件利用率之间的权衡是什么？
RQ3分布式架构和通信策略如何影响DNN的可扩展训练和推理？
RQ4并行编程模型和库（如 MPI、CUDA、Spark）在分布式深度学习中扮演何种角色？
RQ5在深度学习工作负载中实现更高并行性有哪些未来方向？

主要发现

GPU 加速的节点主导着 DL 研究，分布式内存系统对于大规模训练日益重要。
Allreduce 及其他聚集通信模式是分布式 DL 的核心瓶颈，受益于优化的 HPC 技术。
小批量大小在统计泛化与硬件利用率之间起着关键平衡作用，理论与实证证据指引热身、学习率调度和方差控制。
卷积、池化及归一化算子是推动 DL 工作负载及其并行化策略的主要计算核。
DNN 的训练和推理可以映射到带工作量和深度的有向无环图（DAG），使通过 Work-Depth 模型进行并行性分析成为可能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。