QUICK REVIEW

[论文解读] YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design

Yuxuan Cai, Hongjia Li|arXiv (Cornell University)|Sep 12, 2020

Advanced Neural Network Applications参考文献 47被引用 26

一句话总结

YOLObile 提出了一种用于移动设备上实时目标检测的压缩-编译联合设计框架，引入了适用于任意卷积核尺寸的块穿孔剪枝（block-punched pruning）以及 GPU-CPU 协同推理机制。其在三星 Galaxy S20 上实现了 14× 模型压缩，mAP 达 49.0，推理速度达 19.1 FPS，相比 YOLOv4 提升了 5 倍速度，同时保持了高精度。

ABSTRACT

The rapid development and wide utilization of object detection techniques have aroused attention on both accuracy and speed of object detectors. However, the current state-of-the-art object detection works are either accuracy-oriented using a large model but leading to high latency or speed-oriented using a lightweight model but sacrificing accuracy. In this work, we propose YOLObile framework, a real-time object detection on mobile devices via compression-compilation co-design. A novel block-punched pruning scheme is proposed for any kernel size. To improve computational efficiency on mobile devices, a GPU-CPU collaborative scheme is adopted along with advanced compiler-assisted optimizations. Experimental results indicate that our pruning scheme achieves 14$ imes$ compression rate of YOLOv4 with 49.0 mAP. Under our YOLObile framework, we achieve 17 FPS inference speed using GPU on Samsung Galaxy S20. By incorporating our proposed GPU-CPU collaborative scheme, the inference speed is increased to 19.1 FPS, and outperforms the original YOLOv4 by 5$ imes$ speedup. Source code is at: \url{https://github.com/nightsnack/YOLObile}.

研究动机与目标

通过实现高精度、低延迟的推理，解决移动目标检测中精度与速度之间的权衡问题。
克服现有剪枝方法（尤其是基于模式的剪枝）的局限性，实现对任意卷积核尺寸的有效剪枝。
设计一种联合优化框架，集成模型压缩与编译器优化，以最大化移动平台上的硬件并行性。
在消费级移动设备上实现实时推理（≥15 FPS），同时不牺牲检测精度。

提出的方法

提出块穿孔剪枝：将深度神经网络（DNN）权重划分为等大小的块，并在每个块内均匀剪枝，从而实现对任意卷积核尺寸的细粒度稀疏性。
采用 GPU-CPU 协同计算方案，以充分利用移动设备上的异构硬件并行性。
集成编译器优化技术，包括紧凑存储、块重排与自动调优，以提升计算效率。
采用逐层压缩率调节策略，针对 3×3 CONV 层分配更高的剪枝比例，因其在总计算量中占主导地位。
利用块穿孔剪枝带来的结构化稀疏性，实现硬件加速，同时保持模型精度。
优化块大小为 8×4，以在移动 GPU/CPU 架构上平衡精度与推理速度。

实验结果

研究问题

RQ1能否设计一种剪枝方案，以支持卷积层中任意核尺寸的剪枝，同时保持高精度并支持硬件加速？
RQ2如何有效协调 GPU-CPU 协同机制，以最大化移动设备上的推理速度？
RQ3在移动硬件上，块大小、剪枝粒度与推理性能之间的最优权衡是什么？
RQ4编译器优化能否显著提升剪枝模型在移动平台上的效率？

主要发现

YOLObile 在仅导致 mAP 下降 7.9%（从 57.3 降至 49.0）的情况下，实现了 YOLOv4 模型 14× 的压缩率，展现出优异的精度保持能力。
该框架在三星 Galaxy S20 上实现了 19.1 FPS 的推理速度，相比原始 YOLOv4（3.5 FPS）提升了 5 倍。
块穿孔剪枝实现了 14× 压缩率与 49.0 mAP 的性能，相较于基于模式的剪枝，在高倍率压缩下表现出更高的精度与更快的速度。
GPU-CPU 协同方案将推理速度从 17 FPS 提升至 19.1 FPS，证明了硬件资源的有效利用。
消融实验表明，非均匀剪枝（即对 3×3 CONV 层分配更高压缩率）相比均匀剪枝，能获得更优的精度与速度表现。
最优块大小为 8×4（8 个滤波器，4 个通道），在移动硬件上实现了精度与速度的最佳平衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。