QUICK REVIEW

[论文解读] Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines

Hao Zhang, Zhiting Hu|arXiv (Cornell University)|Dec 19, 2015

Advanced Neural Network Applications参考文献 23被引用 43

一句话总结

Poseidon 是一种系统架构，仅使用通用以太网即可在多台配备 GPU 的机器上实现高效、可扩展的深度学习训练。通过集成三级混合架构、无等待反向传播算法（DWBP）以及结构感知通信协议（SACP），Poseidon 在 8 个节点上对 AlexNet 实现最高 6 倍加速，对 GoogLeNet 实现最高 4 倍加速，接近线性加速，同时保持收敛性和准确性。

ABSTRACT

Deep learning (DL) has achieved notable successes in many machine learning tasks. A number of frameworks have been developed to expedite the process of designing and training deep neural networks (DNNs), such as Caffe, Torch and Theano. Currently they can harness multiple GPUs on a single machine, but are unable to use GPUs that are distributed across multiple machines; as even average-sized DNNs can take days to train on a single GPU with 100s of GBs to TBs of data, distributed GPUs present a prime opportunity for scaling up DL. However, the limited bandwidth available on commodity Ethernet networks presents a bottleneck to distributed GPU training, and prevents its trivial realization. To investigate how to adapt existing frameworks to efficiently support distributed GPUs, we propose Poseidon, a scalable system architecture for distributed inter-machine communication in existing DL frameworks. We integrate Poseidon with Caffe and evaluate its performance at training DNNs for object recognition. Poseidon features three key contributions that accelerate DNN training on clusters: (1) a three-level hybrid architecture that allows Poseidon to support both CPU-only and GPU-equipped clusters, (2) a distributed wait-free backpropagation (DWBP) algorithm to improve GPU utilization and to balance communication, and (3) a structure-aware communication protocol (SACP) to minimize communication overheads. We empirically show that Poseidon converges to same objectives as a single machine, and achieves state-of-art training speedup across multiple models and well-established datasets using a commodity GPU cluster of 8 nodes (e.g. 4.5x speedup on AlexNet, 4x on GoogLeNet, 4x on CIFAR-10). On the much larger ImageNet22K dataset, Poseidon with 8 nodes achieves better speedup and competitive accuracy to recent CPU-based distributed systems such as Adam and Le et al., which use 10s to 1000s of nodes.

研究动机与目标

在仅使用通用以太网的多台配备 GPU 的机器上，实现深度神经网络的高效分布式训练。
克服通用集群中因机器间带宽受限导致的通信瓶颈。
在不需完全重写的情况下，增强现有单机深度学习框架（如 Caffe）的分布式 GPU 能力。
在多节点 GPU 集群中实现高 GPU 利用率并最小化通信开销。
在使用数据并行性跨多台机器扩展训练时，保持收敛性和准确性。

提出的方法

引入一种三级混合架构，支持仅 CPU 和配备 GPU 的集群，实现在通用硬件上的灵活部署。
采用分布式无等待反向传播（DWBP）算法，将通信与计算重叠，减少空闲时间，提升 GPU 利用率。
设计一种结构感知通信协议（SACP），通过根据网络拓扑和层结构智能组织参数同步，最小化通信开销。
采用滞后的同步并行（SSP）一致性模型，允许参数更新中的可控延迟，提升带宽利用率并减少同步延迟。
通过扩展现有深度学习框架（如 Caffe）以支持分布式通信和同步原语，实现与这些框架的无缝集成。
采用批量同步并行（BSP）和 SSP 模式，在收敛稳定性与训练速度之间取得平衡，其中 SSP 允许部分异步以提升吞吐量。

实验结果

研究问题

RQ1如何扩展现有深度学习框架，使其仅使用通用以太网即可高效利用多台机器上的分布式 GPU？
RQ2在多 GPU、多节点深度学习集群中，需要哪些系统级优化来克服通信瓶颈？
RQ3混合架构在保持高性能和可扩展性的同时，能否同时支持仅 CPU 和配备 GPU 的集群？
RQ4无等待反向传播和结构感知通信在多大程度上可减少训练延迟并提升 GPU 利用率？
RQ5SSP 一致性模型在通用硬件上的分布式深度学习中，如何影响收敛性、准确性和加速比？

主要发现

Poseidon 在 8 个 GPU 节点上对 AlexNet 实现 4.5 倍加速，对 GoogLeNet 实现 4 倍加速，启用 SACP 后，AlexNet 的加速比进一步提升至 6 倍。
在启用 DWBP 和 SACP 的情况下，Poseidon 将 8 个节点扩展时的吞吐量损失从无优化时的 80% 降低至不足 25%，接近线性加速。
在大规模 ImageNet 22K 数据集上，Poseidon 使用 8 个节点实现与近期基于 CPU 的系统（如 Adam 和 Le 等人）相当的准确率，且加速比更优，尽管节点数量远少于后者。
在 4 个节点的 AlexNet 上，SSP 一致性模型将吞吐量提升最高达 27%（从 3.0 提升至 3.8 加速比），表明对慢速节点的敏感性降低。
Poseidon 收敛至与单机训练相同的损失值，证实其分布式训练保持了模型的准确性和稳定性。
SACP 因参数矩阵重构引入少量计算开销，但仍带来显著的净性能提升，尤其在 AlexNet 等深层模型上更为明显。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。