QUICK REVIEW

[论文解读] Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Hao Zhang, Zeyu Zheng|arXiv (Cornell University)|Jun 10, 2017

Advanced Neural Network Applications参考文献 29被引用 197

一句话总结

Poseidon 引入了一种分层、无等待、混合通信架构，用于在 GPUs 上进行数据并行的分布式深度学习，通过重叠计算与通信并为每层选择最优的通信方法，实现近线性扩展。

ABSTRACT

Deep learning models can take weeks to train on a single GPU-equipped machine, necessitating scaling out DL training to a GPU-cluster. However, current distributed DL implementations can scale poorly due to substantial parameter synchronization over the network, because the high throughput of GPUs allows more data batches to be processed per unit time than CPUs, leading to more frequent network synchronization. We present Poseidon, an efficient communication architecture for distributed DL on GPUs. Poseidon exploits the layered model structures in DL programs to overlap communication and computation, reducing bursty network communication. Moreover, Poseidon uses a hybrid communication scheme that optimizes the number of bytes required to synchronize each layer, according to layer properties and the number of machines. We show that Poseidon is applicable to different DL frameworks by plugging Poseidon into Caffe and TensorFlow. We show that Poseidon enables Caffe and TensorFlow to achieve 15.5x speed-up on 16 single-GPU machines, even with limited bandwidth (10GbE) and the challenging VGG19-22K network for image classification. Moreover, Poseidon-enabled TensorFlow achieves 31.5x speed-up with 32 single-GPU machines on Inception-V3, a 50% improvement over the open-source TensorFlow (20x speed-up).

研究动机与目标

动机说明在 GPU 集群上由于参数同步的突发性和大规模性而需要可扩展的分布式深度学习。
提出 Poseidon，利用 DL 模型的逐层结构实现计算与通信的重叠。
引入混合通信方案，根据每层的特性选择最省成本的同步方法。
通过将 Poseidon 集成到 Caffe 与 TensorFlow，展示跨框架的适用性。

提出的方法

将 DL 训练分解为逐层计算与同步步骤，以实现前向/反向传播与通信的重叠。
引入 Wait-free Backpropagation (WFBP)，通过并发调度独立操作，将梯度同步与低层计算重叠。
提出 Hybrid Communication (HybComm)，基于层属性和集群配置为每层选择最优的同步方法（PS、SFB、Adam-like 策略）。
将 Poseidon 实现为三组件系统（coordinator、KV 存储、client library），提供管理通信调度与传输的 API。
展示与现有框架（Caffe 和 TensorFlow）的集成，代码改动最小且在多达 32 GPUs 时接近线性扩展。

实验结果

研究问题

RQ1如何重新组织 DL 训练以隐藏通信成本并降低 GPU 集群上的网络突发？
RQ2相较于标准的 PS 或 SFB 方案，在不同带宽和模型规模下，逐层混合通信策略是否能提高吞吐量？
RQ3Poseidon 在多个 DL 框架和大规模模型中实现近线性吞吐扩展的程度？
RQ4Poseidon 对收敛速度和代表性 CNNs 与数据集的整体训练效率的影响？

主要发现

Poseidon 在多种模型和框架下实现了多达 32 Titan X GPUs 的近线性吞吐扩展。
在 32 节点上，TensorFlow 搭配 Poseidon 对 Inception-V3 的加速为 31.5x，速度提升相比原 TensorFlow 提升 50%。
在 16 台机器，带宽受限为 10GbE 的情况下，Poseidon 对于大模型（如 VGG19-22K）的扩展性优于基于 PS 的并行化。
Poseidon 通过自动为每层选择最佳通信方法来降低网络通信瓶颈，提升带宽利用率。
与 Adam 等 SF 策略或 CNTK 的 1-bit 量化相比，Poseidon提供更高的算法吞吐量或更强的统计性能稳定性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。