QUICK REVIEW

[论文解读] DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices

Dawei Li, Xiaolong Wang|arXiv (Cornell University)|Aug 16, 2017

Advanced Neural Network Applications被引用 27

一句话总结

DeepRebirth 通过优化非张量层（如池化和归一化）来加速移动设备上的深度神经网络推理，引入两种新型操作：流线瘦身（垂直合并连续的非张量层与张量层）和分支瘦身（水平融合并行的非张量层与小卷积核张量分支为单个卷积层）。该方法在 GoogLeNet 上实现了超过 3 倍的加速和 2.5 倍的内存减少，仅导致 0.4% 的 top-5 准确率下降，并在三星 Galaxy S6 上实现了 65ms 的推理时间，top-5 准确率为 86.5%，在速度和准确率上均优于 SqueezeNet。

ABSTRACT

Deploying deep neural networks on mobile devices is a challenging task. Current model compression methods such as matrix decomposition effectively reduce the deployed model size, but still cannot satisfy real-time processing requirement. This paper first discovers that the major obstacle is the excessive execution time of non-tensor layers such as pooling and normalization without tensor-like trainable parameters. This motivates us to design a novel acceleration framework: DeepRebirth through "slimming" existing consecutive and parallel non-tensor and tensor layers. The layer slimming is executed at different substructures: (a) streamline slimming by merging the consecutive non-tensor and tensor layer vertically; (b) branch slimming by merging non-tensor and tensor branches horizontally. The proposed optimization operations significantly accelerate the model execution and also greatly reduce the run-time memory cost since the slimmed model architecture contains less hidden layers. To maximally avoid accuracy loss, the parameters in new generated layers are learned with layer-wise fine-tuning based on both theoretical analysis and empirical verification. As observed in the experiment, DeepRebirth achieves more than 3x speed-up and 2.5x run-time memory saving on GoogLeNet with only 0.4% drop of top-5 accuracy on ImageNet. Furthermore, by combining with other model compression techniques, DeepRebirth offers an average of 65ms inference time on the CPU of Samsung Galaxy S6 with 86.5% top-5 accuracy, 14% faster than SqueezeNet which only has a top-5 accuracy of 80.5%.

研究动机与目标

解决由池化和归一化等非张量层导致的移动 DNN 推理瓶颈，这些层虽无参数但执行时间过长。
在不损失准确率的前提下，降低移动 CPU 上的模型推理延迟和运行时内存消耗。
开发一种训练后优化框架，通过将非张量层与张量层融合为更高效的单层等效结构，重构现有模型。
通过加速常被现有压缩技术忽略的非张量层，实现在移动设备上的实时推理。
与现有模型压缩方法无缝集成，并在 GoogLeNet 和 ResNet-50 等最先进架构上提升性能。

提出的方法

流线瘦身将连续的非张量层（如 ReLU、池化、BatchNorm）与其前序张量层（如卷积）合并为单个优化后的卷积层。
分支瘦身将并行分支（尤其是具有小卷积核，如 1x1，及非张量层的分支）融合为单个大卷积核（如 5x5）的卷积层，从而降低计算开销。
通过逐层微调学习新生成的“瘦身”层的参数，以在结构重组后保持模型准确率。
框架结合理论分析与实证验证，确保合并过程中准确率下降最小化。
该方法为训练后应用，无需从头开始训练，且与现有深度学习模型和压缩流水线兼容。
通过闭式变换将批量归一化层直接整合到前序卷积层中，进一步加速推理且不损失准确率。

实验结果

研究问题

RQ1为何池化和归一化等非张量层虽无参数，却在移动 CPU 上主导推理延迟？
RQ2能否通过将非张量层与相邻张量层合并，显著减少推理时间与内存使用，同时不降低模型准确率？
RQ3所提出的流线瘦身与分支瘦身技术在移动硬件上对 GoogLeNet 和 ResNet-50 等先进模型的加速效果如何？
RQ4DeepRebirth 与其它压缩技术结合后，能在多大程度上实现在移动设备上的实时推理？
RQ5通过微调，能否有效将复杂多层子结构的知识迁移至简化的单层等效结构中？

主要发现

在 ImageNet 上，DeepRebirth 在 GoogLeNet 上实现了超过 3 倍的加速和 2.5 倍的运行时内存使用减少，top-5 准确率仅下降 0.4%。
在三星 Galaxy S6 CPU 上，优化后的模型实现 65ms 推理时间，top-5 准确率为 86.5%，比 SqueezeNet 快 14%，准确率高 6%。
在 ResNet-50 的 conv1 和 res2a 层上，推理延迟从 189ms 降低至 104ms，运行时内存成本减少 2.21 倍。
批量归一化层可直接合并到前序卷积层中，实现额外 30–45% 的加速，且不损失准确率。
该框架在多个模型上保持高准确率，且与现有压缩技术兼容，支持进一步优化。
合并后进行逐层微调可确保最小准确率损失，在 ResNet-50 上压缩率达 31.9% 时，准确率仅损失 0.31%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。