QUICK REVIEW

[论文解读] Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning

Scott Cyphers, Arjun K. Bansal|arXiv (Cornell University)|Jan 24, 2018

Parallel Computing and Optimization Techniques参考文献 2被引用 105

一句话总结

该论文提出 Intel nGraph，一个框架-桥接的中间表示和编译器-执行栈，旨在在不同框架和硬件后端之间优化深度学习性能。

ABSTRACT

The Deep Learning (DL) community sees many novel topologies published each year. Achieving high performance on each new topology remains challenging, as each requires some level of manual effort. This issue is compounded by the proliferation of frameworks and hardware platforms. The current approach, which we call "direct optimization", requires deep changes within each framework to improve the training performance for each hardware backend (CPUs, GPUs, FPGAs, ASICs) and requires $\mathcal{O}(fp)$ effort; where $f$ is the number of frameworks and $p$ is the number of platforms. While optimized kernels for deep-learning primitives are provided via libraries like Intel Math Kernel Library for Deep Neural Networks (MKL-DNN), there are several compiler-inspired ways in which performance can be further optimized. Building on our experience creating neon (a fast deep learning library on GPUs), we developed Intel nGraph, a soon to be open-sourced C++ library to simplify the realization of optimized deep learning performance across frameworks and hardware platforms. Initially-supported frameworks include TensorFlow, MXNet, and Intel neon framework. Initial backends are Intel Architecture CPUs (CPU), the Intel(R) Nervana Neural Network Processor(R) (NNP), and NVIDIA GPUs. Currently supported compiler optimizations include efficient memory management and data layout abstraction. In this paper, we describe our overall architecture and its core components. In the future, we envision extending nGraph API support to a wider range of frameworks, hardware (including FPGAs and ASICs), and compiler optimizations (training versus inference optimizations, multi-node and multi-device scaling via efficient sub-graph partitioning, and HW-specific compounding of operations).

研究动机与目标

激励需要一个与框架和后端无关的路径，以加速深度学习工作负载。
描述 nGraph 中间表示及其基于图的结构。
解释将前端计算图转换为 nGraph IR 的框架桥接。
概述 transformer backend 以及它们如何为 CPU、NNPs 和 GPUs 生成优化代码。
讨论扩大对框架、硬件和优化覆盖范围的未来方向。

提出的方法

将框架无关的 IR 定义为具有输入、输出和属性的无状态操作节点的有向无环图。
描述将前端计算图（如 TensorFlow、MXNet、neon）转换为 nGraph IR 的框架桥接。
解释为特定后端编译 IR 的 transformer，并提供内存管理、布局处理和内核选择。
详细说明针对 CPU（MKL-DNN）、NNP 和 NVIDIA GPUs（cuDNN、LLVM/PTX）的后端专用 transformer。
通过 transformer 讨论对图中集体通信和点对点通信的支持（MPI 或优化方法）。
提出与 ONNX 的互操作性，以及在未来工作中扩大对更多框架和硬件的支持的计划。

实验结果

研究问题

RQ1框架无关的 IR 如何在多后端之间实现对深度学习执行的优化？
RQ2框架桥接在将前端图转换为 nGraph IR 中扮演什么角色？
RQ3后端 transformer 如何为 CPU、NNP 和 GPU 后端优化代码生成？
RQ4支持训练以及多节点/多设备扩展的潜在方向有哪些？
RQ5nGraph 如何与不断演进的标准及深度学习领域其他编译器/IR 努力互操作？

主要发现

nGraph 提供了一个框架桥接的 IR，使后端能够在 CPU、NNP 和 GPU 上执行相同的计算。
Transformers 生成后端优化代码，并与 MKL-DNN、cuDNN 等库集成以利用硬件能力。
该 IR 是一个具有可适应数据布局和属性的无状态操作节点的有向无环图，用于优化。
nGraph 通过将前端图映射到 IR 的框架桥接，支持端到端的编译和执行工作流。
未来工作中有更广泛互操作性（例如 ONNX）以及扩展到更多框架和硬件的愿景。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。