QUICK REVIEW

[论文解读] Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Nicolas Vasilache, Oleksandr Zinenko|arXiv (Cornell University)|Feb 13, 2018

Computational Physics and Python Applications参考文献 61被引用 252

一句话总结

引入 Tensor Comprehensions (TC)，这是一种领域特定语言和多面体 JIT 编译器，能够为 ML 工作负载生成高性能的 CUDA 内核，实现跨框架集成和带自动调优的算子融合。

ABSTRACT

Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing frameworks for building these networks such as TensorFlow, Chainer, CNTK, Torch/PyTorch, Caffe1/2, MXNet and Theano, explore different tradeoffs between usability and expressiveness, research or production orientation and supported hardware. They operate on a DAG of computational operators, wrapping high-performance libraries such as CUDNN (for NVIDIA GPUs) or NNPACK (for various CPUs), and automate memory allocation, synchronization, distribution. Custom operators are needed where the computation does not fit existing high-performance library calls, usually at a high engineering cost. This is frequently required when new operators are invented by researchers: such operators suffer a severe performance penalty, which limits the pace of innovation. Furthermore, even if there is an existing runtime call these frameworks can use, it often doesn't offer optimal performance for a user's particular network architecture and dataset, missing optimizations between operators as well as optimizations that can be done knowing the size and shape of data. Our contributions include (1) a language close to the mathematics of deep learning called Tensor Comprehensions, (2) a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, also providing optimizations such as operator fusion and specialization for specific sizes, (3) a compilation cache populated by an autotuner. [Abstract cutoff]

研究动机与目标

提供一种简明、便于表达数学的语言，用于表示 ML 的张量计算。
将张量理解转换为带有内存管理和调度的优化 GPU 代码。
通过将 TC 与 PyTorch 和 Caffe2 连接，实现跨框架集成。
通过自带的自动调优器提供自动调优，以探索优化机会。
展示具有竞争力的性能并与现有 ML 框架的实际集成。

提出的方法

定义 Tensor Comprehensions (TC)，一种接近爱因斯坦求和记号的张量运算记法。
开发一个多面体式的即时编译器（JIT）将 TC 转化为带内存管理和同步的 CUDA 内核。
创建一个定制的多面体优化流程，包含内核融合和尺寸特异化。
实现一个利用 JIT 编译和编译缓存的自动调优框架。
通过进程内接口和 ATen 异步张量库，将 TC 与 PyTorch 和 Caffe2 集成。

实验结果

研究问题

RQ1TC 能否简洁且安全地表达常见和自定义的 ML 算子？
RQ2基于 TC 的流程在标准 ML 内核上是否能达到与厂商库竞争的性能？
RQ3基于多面体的优化器是否能有效融合算子并针对深度学习模型的数据大小/形状进行优化？
RQ4TC 与主流框架如 PyTorch 和 Caffe2 的集成程度如何，能否提供面向生产的路径？
RQ5自动调优和 JIT 编译对实现硬件特定性能提升有何影响？

主要发现

在与 ML 工作负载相关的内核上，TC 流程实现了最多 4x 的速度提升，超过 NVIDIA 库。
端到端的 TC 与 Caffe2 及 PyTorch 的集成展示了实用、面向生产的适用性。
一个专门的多面体编译器为具有长依赖的深度学习内核提供有效优化。
自动调优和代码缓存使非标准大小和布局的领域特定优化成为可能。
该框架支持安全的原地更新以及对常见层的简单、声明式定义（例如 SGEMM、conv2d、maxpool）。
该系统在核心层保持框架无关，同时与生产环境紧密集成。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。