[Paper Review] TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TVM is an end-to-end compiler that enables performance-portable deep learning workloads across CPUs, GPUs, and accelerators by combining graph-level and operator-level optimizations with an ML-based cost model for automatic scheduling.
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.
Motivation & Objective
- Identify optimization challenges for performance-portable DL workloads across diverse hardware back-ends.
- Develop an end-to-end compilation stack that maps high-level DL programs to optimized low-level code.
- Enable joint high- and low-level optimizations via a tensor expression language and schedule space.
- Automate operator optimization with a learning-based cost model to explore large optimization spaces.
- Demonstrate portability and performance across CPUs, GPUs, and FPGA-based accelerators.
Proposed method
- Introduce a tensor expression language and transformation primitives to separate computation from hardware intrinsics.
- Extend Halide-inspired compute/schedule separation to support novel GPUs and accelerators.
- Build an automated optimization framework guided by an ML-based cost model that predicts performance of lowered programs.
- Incorporate tensorization to map computations to hardware intrinsics via declarative tensor intrinsics and a tensorize primitive.
- Implement memory latency hiding and explicit memory scopes to optimize for diverse memory hierarchies.
- Provide an end-to-end stack that compiles models from frameworks (TensorFlow, MXNet, PyTorch, etc.) to hardware-specific optimized code.
Experimental results
Research questions
- RQ1Can TVM achieve portable, competitive performance across server GPUs, embedded GPUs/CPUs, and FPGA-based accelerators without vendor-specific operator libraries?
- RQ2How effective is ML-guided optimization in navigating the large operator/schedule search space compared to black-box auto-tuning?
- RQ3What are the gains from graph-level optimizations (e.g., fusion, data layout) combined with operator-level code generation on diverse back-ends?
- RQ4How does latency hiding and tensorization impact performance on specialized accelerators like TPU-like devices and FPGAs?
- RQ5Can TVM support emerging workloads (e.g., depthwise convolution, low-precision ops) and new accelerators while maintaining performance portability?
Key findings
- TVM delivers portable performance across back-ends with speedups of 1.2× to 3.8× over hand-tuned libraries in existing frameworks.
- Operator fusion and graph optimizations significantly reduce memory accesses, yielding substantial runtime improvements.
- Latency hiding using virtual threading and explicit memory scopes increased compute utilization (example: ResNet on FPGA accelerator improved from 70% to 88% peak).
- Tensorization decouples hardware intrinsics from schedules, enabling easy support for new accelerators and achieving up to 1.5× speedups on micro-kernels.
- An ML-based cost model enables faster discovery of optimized operator implementations, outperforming black-box auto-tuning in both speed and quality of configurations.
- The system is open-sourced and deployed in industry, demonstrating practical viability across CPUs, GPUs, and FPGA-based accelerators.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.