QUICK REVIEW

[Paper Review] On-Device Neural Net Inference with Mobile GPUs

Ju Hyun Lee, Nikolay Chirkov|arXiv (Cornell University)|Jul 3, 2019

Advanced Neural Network Applications10 references58 citations

TL;DR

The paper presents a TensorFlow Lite GPU backend that enables real-time on-device neural network inference on mobile GPUs using OpenGL ES for Android and Metal for iOS, achieving 2–9× faster inference than CPU and detailing GPU-friendly network design and memory management strategies.

ABSTRACT

On-device inference of machine learning models for mobile phones is desirable due to its lower latency and increased privacy. Running such a compute-intensive task solely on the mobile CPU, however, can be difficult due to limited computing power, thermal constraints, and energy consumption. App developers and researchers have begun exploiting hardware accelerators to overcome these challenges. Recently, device manufacturers are adding neural processing units into high-end phones for on-device inference, but these account for only a small fraction of hand-held devices. In this paper, we present how we leverage the mobile GPU, a ubiquitous hardware accelerator on virtually every phone, to run inference of deep neural networks in real-time for both Android and iOS devices. By describing our architecture, we also discuss how to design networks that are mobile GPU-friendly. Our state-of-the-art mobile GPU inference engine is integrated into the open-source project TensorFlow Lite and publicly available at https://tensorflow.org/lite.

Motivation & Objective

Demonstrate real-time neural network inference on mobile GPUs across Android and iOS devices.
Integrate a GPU backend into TensorFlow Lite that works with OpenGL ES 3.1+ and Metal 9+ across devices.
Propose GPU-friendly data layouts and shader-level optimizations to maximize throughput.

Proposed method

Describe the architecture of the TFLite GPU backend and its delegate-based graph partitioning.
Use Compute Shaders to implement neural network operators and fuse operations to reduce shader count.
Adopt the PHWC4 tensor layout to optimize memory access and cache utilization on mobile GPUs.
Implement a memory management strategy for intermediate tensors to minimize peak GPU memory, via Greedy and Minimum-Cost Flow approaches.
Tune work group sizes per device and operator type to balance compute and memory efficiency.

Experimental results

Research questions

RQ1Can a mobile GPU backend provide real-time or near-real-time inference on common mobile devices using TensorFlow Lite?
RQ2What data layout and shader strategies optimize memory I/O and compute utilization on mobile GPUs?
RQ3How can intermediate tensors be managed to minimize GPU memory footprint during on-device inference?
RQ4What is the impact of GPU backend on latency across representative networks and devices compared to CPU inference?

Key findings

The GPU backend achieved 2–9× average speedup over CPU inference across various networks.
PHWC4 memory layout reduces cache misses by aligning tensors to 4-channel groups and improves memory coalescing for GPU threads.
A GPU-specific optimization pipeline includes fusion of element-wise ops with heavier ops, inlining constants, and architecture-aware shader specialization.
An intermediate-tensor memory management strategy (Greedy or Minimum-Cost Flow) reduces peak GPU memory significantly (e.g., memory footprint reductions shown in Table 3).
Optimal work group sizes vary by GPU; Adreno GPUs show substantial gains from tuning, while Mali GPUs are more robust to changes; a practical table of recommended sizes is provided (Table 2).
TFLite GPU demonstrates reasonable coverage and performance across devices, with iOS devices benefiting from larger caches and OpenGL vs. OpenCL backends.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.