QUICK REVIEW

[论文解读] Memory-aware fusing and tiling of neural networks for accelerated edge inference

Jackson Farley|arXiv (Cornell University)|May 1, 2021

Advanced Neural Network Applications参考文献 12被引用 3

一句话总结

本文提出了一种面向边缘设备上神经网络推理的内存感知融合与分块技术，通过独立地对两组卷积层进行融合与分块，将内存使用量减少50%以上，并在内存受限条件下实现最高2.78倍的加速。该方法引入了内存预测器与搜索算法，可自动找到接近人工调优结果（延迟差异在6%以内）的最优配置。

ABSTRACT

A rising research challenge is running costly machine learning (ML) networks locally on resource-constrained edge devices. ML networks with large convolutional layers can easily exceed available memory, increasing latency due to excessive swapping. Previous memory reduction techniques such as pruning and quantization reduce model accuracy and often require retraining. Alternatively, distributed methods partition the convolutions into equivalent smaller sub-computations, but the implementations introduce communication costs and require a network of devices. However, a distributed partitioning approach can also be used to run in a reduced memory footprint on a single device by subdividing the network into smaller operations. This report extends prior work on distributed partitioning using tiling and fusing of convolutional layers into a memory-aware execution on a single device. Our approach extends prior fusing strategies to allow for two groups of convolutional layers that are fused and tiled independently. This approach reduces overhead via data reuse, and reduces the memory footprint further. We also propose a memory usage predictor coupled with a search algorithm to provide fusing and tiling configurations for an arbitrary set of convolutional layers. When applied to the YOLOv2 object detection network, results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints. Additionally, our algorithm will return a configuration with a latency that is within 6% of the best latency measured in a manual search.

研究动机与目标

解决在资源受限的边缘设备上运行大型、内存密集型神经网络的挑战。
在不牺牲模型精度的前提下减少内存占用，避免需要微调或采用剪枝、量化等降低精度的技术。
通过在单个设备上应用类似分布式系统的分块与融合技术，在单体执行模型中实现高效推理。
开发一种自动配置搜索方法，以在内存效率与推理延迟之间取得平衡。

提出的方法

该方法扩展了传统融合技术，支持对两组不同的卷积层分别进行独立的融合与分块。
通过分块技术将大型卷积运算拆分为更小、可管理的计算单元，使其适配有限的片上内存。
内存使用量预测器用于估算不同融合与分块配置的内存占用，以指导搜索过程。
搜索算法遍历配置空间，寻找内存使用与推理延迟之间的最优权衡。
该方法支持在分块操作之间重用数据，以减少冗余的内存访问与计算。
该框架应用于YOLOv2，验证了其在真实世界目标检测模型上的有效性。

实验结果

研究问题

RQ1对两组卷积层分别进行独立分块与融合，是否比整体分块更有效地降低内存使用？
RQ2所提出的内存预测器在引导搜索低内存、高性能配置方面效果如何？
RQ3自动配置搜索在多大程度上可实现接近人工优化设置的延迟性能？
RQ4在内存极度受限的条件下，该方法在真实模型（如YOLOv2）上可实现的最大内存减少与加速比是多少？
RQ5该方法是否能在无需微调或量化的情况下避免精度下降？

主要发现

在严重内存约束下，所提方法将YOLOv2原始模型的内存使用量减少至不足一半。
与基线推理相比，该方法在受限边缘设备上实现了最高2.78倍的加速。
自动搜索算法找到的配置，其延迟与人工搜索所得最佳结果相差不超过6%。
通过利用通常用于分布式系统的分块与融合策略，该方法实现了在单个设备上的高效推理。
内存预测器使配置空间探索更加高效，无需进行大量试错或重新训练。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。