QUICK REVIEW

[论文解读] Horovod: fast and easy distributed deep learning in TensorFlow

Alexander Sergeev, Mike Del Balso|arXiv (Cornell University)|Feb 15, 2018

Advanced Neural Network Applications参考文献 5被引用 522

一句话总结

Horovod 引入基于 ring-allreduce 的分布式 TensorFlow 框架，显著减少代码更改并提升扩展性，在多 GPU 场景下实现近线性加速。它提供一个独立的 Python 包、NCCL 支撑的通信、以及调试/分析工具。

ABSTRACT

Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at https://github.com/uber/horovod

研究动机与目标

在 Uber 说明可扩展分布式 TensorFlow 训练的需求，并指出两个主要障碍：跨 GPU 通信开销和用户代码复杂性。
提出基于 ring-allreduce 的方法以解决可扩展性和简化性。
描述 Horovod 的架构、与 TensorFlow/Keras 的集成，以及尽量减少用户修改的 API 设计。
展示实际工具（Horovod Timeline）和优化（Tensor Fusion），以提升易用性和性能。

提出的方法

采用 Baidu 草案中的 ring-allreduce，并替换为 NVIDIA NCCL 以优化跨 GPU 和跨机器通信。
将 Horovod 实现为独立的 Python 包，以与特定 TensorFlow 发行版解耦。
扩展对在单机（可能是多 GPU）上可容纳的模型的支持。
引入广播初始化钩子，以确保跨工作者的一致启动。
提供最小的 API 表面，允许用户将优化器封装为 hvd.DistributedOptimizer，并从 rank 0 广播变量。
集成 Horovod Timeline 以实现跨节点的分析和调试。
开发了 Tensor Fusion，将小张量在 allreduce 之前融合成更大的缓冲区，以提高 TCP 网络上的吞吐量。

实验结果

研究问题

RQ1Ring-allreduce 基于的通信是否可以在多 GPU 和多机环境中为 TensorFlow 训练提供近线性扩展？
RQ2将单 GPU 的 TensorFlow 程序转换为分布式 Horovod 程序需要多少代码修改？
RQ3在实际工作流中，哪些实际工具和优化（如 Tensor Fusion 和 Timeline）能提升可用性和性能？
RQ4在 TCP 与 RDMA 网络以及不同参数量的模型上，Horovod 的性能特征如何？
RQ5与标准分布式 TensorFlow 相比，Horovod 在效率和资源利用方面有何差异？

主要发现

Horovod 相较于标准分布式 TensorFlow 在可扩展性方面取得显著提升，基准测试中效率最高达 88%。
在多 GPU 上使用 Horovod 时，训练速度相较于标准分布式 TensorFlow 可以接近翻倍。
RDMA 网络在某些模型上提供适度提升（额外 3-4%），并且对某些体系结构的扩展效率可超过 90%。
Tensor Fusion 通过在 allreduce 之前将小张量融合为较大缓冲区，降低通信开销，使具有大量小张量操作的模型性能提升可达 65%。
Horovod 将设置和集成工作量降至少量代码修改，便于团队间的采用。
Horovod Timeline 提供面向浏览器的高级分析与调试，以帮助性能分析。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。