QUICK REVIEW

[논문 리뷰] Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning

Scott Cyphers, Arjun K. Bansal|arXiv (Cornell University)|2018. 01. 24.

Parallel Computing and Optimization Techniques참고 문헌 2인용 수 105

한 줄 요약

본 논문은 프레임워크 간 다리 역할을 하는 중간 표현(IR)과 컴파일러-실행기 스택인 Intel nGraph를 제시하여 프레임워크와 하드웨어 백엔드 전반에서 딥 러닝 성능을 최적화한다.

ABSTRACT

The Deep Learning (DL) community sees many novel topologies published each year. Achieving high performance on each new topology remains challenging, as each requires some level of manual effort. This issue is compounded by the proliferation of frameworks and hardware platforms. The current approach, which we call "direct optimization", requires deep changes within each framework to improve the training performance for each hardware backend (CPUs, GPUs, FPGAs, ASICs) and requires $\mathcal{O}(fp)$ effort; where $f$ is the number of frameworks and $p$ is the number of platforms. While optimized kernels for deep-learning primitives are provided via libraries like Intel Math Kernel Library for Deep Neural Networks (MKL-DNN), there are several compiler-inspired ways in which performance can be further optimized. Building on our experience creating neon (a fast deep learning library on GPUs), we developed Intel nGraph, a soon to be open-sourced C++ library to simplify the realization of optimized deep learning performance across frameworks and hardware platforms. Initially-supported frameworks include TensorFlow, MXNet, and Intel neon framework. Initial backends are Intel Architecture CPUs (CPU), the Intel(R) Nervana Neural Network Processor(R) (NNP), and NVIDIA GPUs. Currently supported compiler optimizations include efficient memory management and data layout abstraction. In this paper, we describe our overall architecture and its core components. In the future, we envision extending nGraph API support to a wider range of frameworks, hardware (including FPGAs and ASICs), and compiler optimizations (training versus inference optimizations, multi-node and multi-device scaling via efficient sub-graph partitioning, and HW-specific compounding of operations).

연구 동기 및 목표

딥 러닝 워크로드를 가속하기 위한 프레임워크- 및 백엔드-독립 경로의 필요성을 동기부여한다.
nGraph 중간 표현(IR)과 그 그래프 기반 구조를 설명한다.
프런트엔드 그래프를 nGraph IR로 번역하는 프레임워크 브리지들을 설명한다.
CPU, NNP, GPU용으로 최적화된 코드를 생성하는 트랜스포머 백엔드들을 개요한다.
프레임워크, 하드웨어, 최적화 커버리지를 확장하기 위한 향후 방향을 논의한다.

제안 방법

입력, 출력 및 속성을 가진 무상태 연산 노드들의 방향성 비순환 그래프로서 프레임워크-독립 IR을 정의한다.
프런트엔드 계산 그래프(TensorFlow, MXNet, neon 등)를 nGraph IR로 변환하는 프레임워크 브리지를 설명한다.
특정 백엔드용으로 IR을 컴파일하고 메모리 관리, 레이아웃 처리 및 커널 선택을 제공하는 트랜스포머를 설명한다.
CPU(MKL-DNN), NNP, 및 NVIDIA GPU(cuDNN, LLVM/PTX)에 대한 백엔드별 트랜스포머를 자세히 다룬다.
트랜스포머를 통한 그래프 내 집합 통신 및 점대점 통신 지원(MPI 또는 최적화된 방법)을 논의한다.
ONNX와의 상호 운용성 제안 및 향후 더 넓은 프레임워크 및 하드웨어 지원 계획을 제시한다.

실험 결과

연구 질문

RQ1프레임워크-독립 IR이 여러 백엔드에 걸쳐 최적화된 딥 러닝 실행을 어떻게 가능하게 할 수 있는가?
RQ2프레임워크 브리지가 프런트엔드 그래프를 nGraph IR로 번역하는 역할은 무엇인가?
RQ3백엔드 트랜스포머가 CPU, NNP, GPU 백엔드에 대한 코드 생성을 어떻게 최적화하는가?
RQ4학습 지원 및 다중 노드/다중 장치 확장을 위한 잠재적 확장은 무엇인가?
RQ5딥 러닝에서 진화하는 표준 및 기타 컴파일러/IR 노력과의 상호 운용성은 어떻게 달성될 수 있는가?

주요 결과

nGraph는 백엔드가 CPU, NNP, GPU에서 동일한 연산을 실행할 수 있게 하는 프레임워크-브리지 IR을 제공한다.
트랜스포머는 MKL-DNN 및 cuDNN과 같은 라이브러리와 통합하여 하드웨어 기능을 활용하는 백엔드 최적화 코드를 생성한다.
IR은 최적화를 위한 적응 가능한 데이터 레이아웃 및 속성을 가진 무상태 연산 노드들의 방향성 비순환 그래프이다.
nGraph는 프런트엔드 그래프를 IR에 매핑하는 프레임워크 브리지를 통해 엔드-투-엔드 컴파일 및 실행 워크플로를 지원한다.
향후 작업에서 더 넓은 상호 운용성(예: ONNX) 및 추가 프레임워크와 하드웨어 확장에 대한 비전이 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.