QUICK REVIEW

[Paper Review] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau|arXiv (Cornell University)|Feb 12, 2018

Parallel Computing and Optimization Techniques38 references140 citations

TL;DR

TVM is an end-to-end compiler that enables performance-portable deep learning workloads across CPUs, GPUs, and accelerators by combining graph-level and operator-level optimizations with an ML-based cost model for automatic scheduling.

ABSTRACT

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

Motivation & Objective

Identify optimization challenges for performance-portable DL workloads across diverse hardware back-ends.
Develop an end-to-end compilation stack that maps high-level DL programs to optimized low-level code.
Enable joint high- and low-level optimizations via a tensor expression language and schedule space.
Automate operator optimization with a learning-based cost model to explore large optimization spaces.
Demonstrate portability and performance across CPUs, GPUs, and FPGA-based accelerators.

Proposed method

Introduce a tensor expression language and transformation primitives to separate computation from hardware intrinsics.
Extend Halide-inspired compute/schedule separation to support novel GPUs and accelerators.
Build an automated optimization framework guided by an ML-based cost model that predicts performance of lowered programs.
Incorporate tensorization to map computations to hardware intrinsics via declarative tensor intrinsics and a tensorize primitive.
Implement memory latency hiding and explicit memory scopes to optimize for diverse memory hierarchies.
Provide an end-to-end stack that compiles models from frameworks (TensorFlow, MXNet, PyTorch, etc.) to hardware-specific optimized code.

Experimental results

Research questions

RQ1Can TVM achieve portable, competitive performance across server GPUs, embedded GPUs/CPUs, and FPGA-based accelerators without vendor-specific operator libraries?
RQ2How effective is ML-guided optimization in navigating the large operator/schedule search space compared to black-box auto-tuning?
RQ3What are the gains from graph-level optimizations (e.g., fusion, data layout) combined with operator-level code generation on diverse back-ends?
RQ4How does latency hiding and tensorization impact performance on specialized accelerators like TPU-like devices and FPGAs?
RQ5Can TVM support emerging workloads (e.g., depthwise convolution, low-precision ops) and new accelerators while maintaining performance portability?

Key findings

TVM delivers portable performance across back-ends with speedups of 1.2× to 3.8× over hand-tuned libraries in existing frameworks.
Operator fusion and graph optimizations significantly reduce memory accesses, yielding substantial runtime improvements.
Latency hiding using virtual threading and explicit memory scopes increased compute utilization (example: ResNet on FPGA accelerator improved from 70% to 88% peak).
Tensorization decouples hardware intrinsics from schedules, enabling easy support for new accelerators and achieving up to 1.5× speedups on micro-kernels.
An ML-based cost model enables faster discovery of optimized operator implementations, outperforming black-box auto-tuning in both speed and quality of configurations.
The system is open-sourced and deployed in industry, demonstrating practical viability across CPUs, GPUs, and FPGA-based accelerators.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.