QUICK REVIEW

[论文解读] On-Device Neural Net Inference with Mobile GPUs

Ju Hyun Lee, Nikolay Chirkov|arXiv (Cornell University)|Jul 3, 2019

Advanced Neural Network Applications参考文献 10被引用 58

一句话总结

本论文提出一个 TensorFlow Lite GPU 后端，能够在移动 GPU 上使用 OpenGL ES（Android）和 Metal（iOS）实现实时的设备端神经网络推理，相较于 CPU 速度提升 2–9×，并详细介绍对 GPU 友好的网络设计与内存管理策略。

ABSTRACT

On-device inference of machine learning models for mobile phones is desirable due to its lower latency and increased privacy. Running such a compute-intensive task solely on the mobile CPU, however, can be difficult due to limited computing power, thermal constraints, and energy consumption. App developers and researchers have begun exploiting hardware accelerators to overcome these challenges. Recently, device manufacturers are adding neural processing units into high-end phones for on-device inference, but these account for only a small fraction of hand-held devices. In this paper, we present how we leverage the mobile GPU, a ubiquitous hardware accelerator on virtually every phone, to run inference of deep neural networks in real-time for both Android and iOS devices. By describing our architecture, we also discuss how to design networks that are mobile GPU-friendly. Our state-of-the-art mobile GPU inference engine is integrated into the open-source project TensorFlow Lite and publicly available at https://tensorflow.org/lite.

研究动机与目标

在 Android 与 iOS 设备上证明实时神经网络推理能力。
将一个支持 OpenGL ES 3.1+ 与 Metal 9+ 的 GPU 后端集成到 TensorFlow Lite，使其在各设备上工作。
提出对 GPU 友好的数据布局与着色器级优化，以最大化吞吐量。

提出的方法

描述 TFLite GPU 后端的架构及其基于委托的图形分区。
使用 Compute Shaders 实现神经网络算子并融合操作以减少着色器数量。
采用 PHWC4 张量布局以优化移动 GPU 的内存访问和缓存利用率。
为中间张量实现内存管理策略，以通过贪婪和最小成本流方法最小化峰值 GPU 内存。
针对不同设备和算子类型调整工作组大小以在计算与内存效率之间取得平衡。

实验结果

研究问题

RQ1移动 GPU 后端是否能在常见移动设备上使用 TensorFlow Lite 提供实时或近实时推理？
RQ2哪种数据布局和着色器策略能在移动 GPU 上优化内存输入输出和计算利用率？
RQ3在设备端推理过程中，如何管理中间张量以最小化 GPU 内存占用？
RQ4相对于 CPU 推理，GPU 后端对代表性网络与设备的延迟有何影响？

主要发现

该 GPU 后端在各种网络上相对于 CPU 推理实现了 2–9× 的平均加速。
PHWC4 内存布局通过将张量对齐到 4 通道组来减少缓存未命中，并改善 GPU 线程的内存合并。
一个特定于 GPU 的优化流水线包括将元素级操作与更重的操作融合、内联常量以及面向体系结构的着色器特化。
中间张量内存管理策略（贪婪或最小成本流）显著降低峰值 GPU 内存占用（例如表 3 中给出的内存占用下降）。
最优工作组大小因 GPU 而异；Adreno GPU 调整后获得显著提升，而 Mali GPU 对变化更具鲁棒性；并提供一份实际的推荐大小表（表 2）。
TFLite GPU 在设备上的覆盖范围和性能表现还算合理，iOS 设备受益于更大的缓存和 OpenGL 相对 OpenCL 后端。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。