QUICK REVIEW

[论文解读] PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross|arXiv (Cornell University)|Dec 3, 2019

Parallel Computing and Optimization Techniques被引用 16,163

一句话总结

PyTorch 将在 Python 中的 eager、命令式执行与 GPU 加速和自动微分相结合，在保持易用性的同时实现具有竞争力的性能。它强调的是一种 Pythonic、灵活的模型构建体验，具有强互操作性和高性能核心。

ABSTRACT

Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.

研究动机与目标

展示命令式、动态执行在深度学习框架中能够达到与静态图性能相匹配的水平。
解释在 PyTorch 中平衡易用性与速度的设计原则。
展示与 Python 生态系统的互操作性以及对自定义操作的可扩展性。
描述实现选择，能够实现高效的 CPU/GPU 执行和自动求导。
提供与其他领先框架的经验评估比较。

提出的方法

提出以 Python 为中心的、eager 执行模型，其中模型、数据和训练循环都是普通的 Python 程序。
解释控制流（Python/C++）与数据流（张量和运算）的分离，以及通过 CUDA 流实现的异步 GPU 执行。
一个自定义的 CUDA 内存分配器，用于降低分配开销和碎片化。
详细说明能够实现数据并行和通过扩展的 torch.multiprocessing 共享 CUDA 张量的多处理支持。
概述与 Python 的 GC 集成的引用计数内存管理策略，以实现可预测的内存释放。
提供对异步数据流、内存管理以及与其他框架的基准测试的评估。

实验结果

研究问题

RQ1一个命令式、按运行定义风格的框架是否能够在深度学习工作负载中达到与静态图框架相当的性能？
RQ2哪些设计选择使 PyTorch 既易用（Pythonic）又在 GPU 上具有高性能？
RQ3PyTorch 如何在保持高效的同时，与更广泛的 Python 生态系统和自定义扩展集成？
RQ4哪些关键的运行时机制（内存管理、CUDA 流、自动求导）在不牺牲易用性的前提下实现性能？

主要发现

在一组常见基准测试中，PyTorch 的性能达到最快的竞争框架的 17% 以内。
异步 GPU 执行通过 CUDA 流将 Python 控制流与 GPU 工作重叠，从而实现高设备利用率。
自定义缓存分配器和引用计数内存管理降低 CUDA 分配开销并实现高效内存重用。
与 NumPy 和 DLPack 的双向互操作性实现零拷贝数据共享，便于与 Python 数据科学工具的集成。
多处理扩展通过将张量数据移动到共享内存并实现高效的进程间通信来加速数据并行训练。
该框架在研究社区内显示出强烈的采用信号（在 arXiv 上有提及）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。