QUICK REVIEW

[论文解读] FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices

Shuochao Yao, Yiran Zhao|arXiv (Cornell University)|Sep 19, 2018

IoT and Edge/Fog Computing参考文献 38被引用 66

一句话总结

FastDeepIoT 识别在移动/嵌入式设备上神经网络结构与执行时间之间的非线性关系，构建可解释的执行时间模型，并利用它来引导压缩，从而在不损失精度的情况下实现显著加速。

ABSTRACT

Deep neural networks show great potential as solutions to many sensing application problems, but their excessive resource demand slows down execution time, pausing a serious impediment to deployment on low-end devices. To address this challenge, recent literature focused on compressing neural network size to improve performance. We show that changing neural network size does not proportionally affect performance attributes of interest, such as execution time. Rather, extreme run-time nonlinearities exist over the network configuration space. Hence, we propose a novel framework, called FastDeepIoT, that uncovers the non-linear relation between neural network structure and execution time, then exploits that understanding to find network configurations that significantly improve the trade-off between execution time and accuracy on mobile and embedded devices. FastDeepIoT makes two key contributions. First, FastDeepIoT automatically learns an accurate and highly interpretable execution time model for deep neural networks on the target device. This is done without prior knowledge of either the hardware specifications or the detailed implementation of the used deep learning library. Second, FastDeepIoT informs a compression algorithm how to minimize execution time on the profiled device without impacting accuracy. We evaluate FastDeepIoT using three different sensing-related tasks on two mobile devices: Nexus 5 and Galaxy Nexus. FastDeepIoT further reduces the neural network execution time by $48\%$ to $78\%$ and energy consumption by $37\%$ to $69\%$ compared with the state-of-the-art compression algorithms.

研究动机与目标

揭示在移动/嵌入式设备上，神经网络执行时间相对于网络结构为何非线性。
开发一个不依赖硬件/库内部实现的、准确且可解释的执行时间模型。
将现有压缩方法引导以最小化执行时间同时保持准确性。
在感知任务和设备上展示实际设备加速和能量节约。

提出的方法

profiling 模块通过产生多样化的网络结构并在目标设备上使用 TensorFlow Benchmark 记录执行时间，构建时间剖面数据集。
树结构线性回归模型将结构-配置空间分区为具有线性时序行为的区域，使用将 FLOPs、内存使用和参数大小结合在一起的解释变量向量。
通过学习一个执行时间模型 Y = w^T x + b，权重和偏置对非负性约束，覆盖体系结构类型（FC、CNN、RNN）。
两条件分裂规则（区间和整数倍）引导递归分割，形成能捕捉非线性时序效应的树。
Compression Steering Module 将时序模型加入压缩目标以最小化执行时间，并包括一个扩展层以接近局部最小值的策略，从而提速运行时间。
算法 1 描述构建树结构模型；算法 2 描述在压缩过程中层扩展和局部最小值搜索。

实验结果

研究问题

RQ1除了参数量和 FLOPs 之外，哪些因素对移动/嵌入式设备上的神经网络执行时间有主要影响？
RQ2如何在不需要详细硬件/库知识的前提下，自动学习一个既准确又可解释的执行时间模型？
RQ3如何将执行时间感知融入现有压缩方法中，以在不牺牲准确性的前提下降低延迟和能耗？

主要发现

在 Nexus 5 和 Galaxy Nexus 上的执行时间建模对各种组件的平均绝对百分比误差（MAPE）约为 1%–7%，优于其他回归方法。
卷积层在通道数与某些整数倍（例如 4 的倍数）对齐时，表现出对局部时序极小值的强烈非线性。
基于剖面的模型显示 FLOPs 和内存是执行时间的显著预测因子，而参数大小与运行时的相关性有限。
与最先进的压缩算法相比，FastDeepIoT 在执行时间上额外降低了 48%–78%，在能耗上降低了 37%–69%，且不损失准确性。
探测和评估使用两台设备（Nexus 5 和 Galaxy Nexus）和 TensorFlow for Mobile，剖面重点关注 FC、CNN、GRU、LSTM 组件。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。