QUICK REVIEW

[论文解读] PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Li Shen, Yanli Zhao|arXiv (Cornell University)|Jun 28, 2020

Software System Performance and Reliability参考文献 26被引用 116

一句话总结

论文介绍了设计、实现和评估 PyTorch DistributedDataParallel (DDP) 以加速数据并行训练，包括梯度分桶、计算与通信的重叠，以及跳过梯度同步，在大规模 GPU 设备上实现近线性可扩展性。

ABSTRACT

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

研究动机与目标

演示 PyTorch 的分布式数据并行模块 (DDP) 的设计与实现。
展示通过同步梯度在分布式和本地训练之间实现数学等价性。
识别性能瓶颈并提出优化技术以最大化训练吞吐量。
提供来自内部和外部部署的真实世界洞察与测量结果。
突出工业规模分布式训练的实际注意事项和未来改进方向。

提出的方法

将 DDP 作为一个 nn.Module 展现，它包装本地模型以确保非侵入式集成。
描述梯度化简技术，包括 autograd 钩子和基于 AllReduce 的梯度平均。
介绍梯度分桶，通过将小梯度聚合成较大的桶来提高 AllReduce 的效率。
解释将计算与通信重叠以隐藏梯度规约中的潜在延迟。
讨论 no_sync 上下文管理器以实现跨多次迭代的梯度累积。
详细讲解集合后端（NCCL、Gloo、MPI）以及用于路由通信的 ProcessGroup 抽象。

实验结果

研究问题

RQ1在不影响用户代码的前提下，PyTorch 的 DDP 如何保证与本地训练在数学上的等价性？
RQ2哪些优化（分桶、重叠、skip_sync）最能提升分布式数据并行训练的性能？
RQ3不同的通信后端（NCCL、Gloo、MPI）如何影响可扩展性和吞吐量？
RQ4在大规模部署 DDP 时有哪些实际的注意事项和故障模式？
RQ5哪些运行时配置（桶大小、process groups、未使用参数处理）会影响收敛和速度？

主要发现

当配置得当时，DDP 在高达 256 个 GPU 上可实现近线性可扩展性。
梯度分桶和计算与通信的重叠显著提升性能，尤其是对于具有大量小参数的模型。
跳过梯度同步（no_sync）在对收敛速度几乎无影响的情况下减少了摊销后的通信开销。
通信是主导的延迟组成部分，桶大小对效率有显著影响；不当的桶大小可能抵消收益。
NCCL 和 Gloo 后端显示出不同的性能特征；桶大小和进程组配置对实现最佳吞吐量至关重要。
实验验证了 DDP 在生产工作负载中的广泛采用和显著影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。