[论文解读] DeepTrust^RT: Confidential Deep Neural Inference Meets Real-Time!
本文提出深度压缩(Deep Compression),一种三阶段流水线,结合剪枝、训练后量化和霍夫曼编码,以在不损失准确率的情况下压缩深度神经网络。该方法将AlexNet的存储需求从240MB减少至6.9MB(35倍),将VGG-16从552MB减少至11.3MB(49倍),实现片上SRAM缓存,并在CPU、GPU和移动GPU平台上实现3倍至7倍的能效提升。
Deep Neural Networks (DNNs) are becoming common in "learning-enabled" time-critical applications such as autonomous driving and robotics. One approach to protect DNN inference from adversarial actions and preserve model privacy/confidentiality is to execute them within trusted enclaves available in modern processors. However, running DNN inference inside limited-capacity enclaves while ensuring timing guarantees is challenging due to (a) large size of DNN workloads and (b) extra switching between "normal" and "trusted" execution modes. This paper introduces new time-aware scheduling schemes - DeepTrust^RT - to securely execute deep neural inferences for learning-enabled real-time systems. We first propose a variant of EDF (called DeepTrust^RT-LW) that slices each DNN layer and runs them sequentially in the enclave. However, due to extra context switch overheads of individual layer slices, we further introduce a novel layer fusion technique (named DeepTrust^RT-FUSION). Our proposed scheme provides hard real-time guarantees by fusing multiple layers of DNN workload from multiple tasks; thus allowing them to fit and run concurrently within the enclaves while maintaining real-time guarantees. We implemented and tested DeepTrust^RT ideas on the Raspberry Pi platform running OP-TEE+DarkNet-TZ DNN APIs and three DNN workloads (AlexNet-squeezed, Tiny Darknet, YOLOv3-tiny). Compared to the layer-wise partitioning approach (DeepTrust^RT-LW), DeepTrust^RT-FUSION can schedule up to 3x more tasksets and reduce context switches by up to 11.12x. We further demonstrate the efficacy of DeepTrust^RT using a flight controller (ArduPilot) case study and find that DeepTrust^RT-FUSION retains real-time guarantees where DeepTrust^RT-LW becomes unschedulable.
研究动机与目标
- 解决在存储和能效受限的移动与嵌入式系统中部署大型、高精度深度神经网络的挑战。
- 减小深度神经网络的存储占用,使其可完全容纳于片上SRAM,避免昂贵的片外DRAM访问。
- 通过减少内存带宽使用来最小化能耗,因为内存带宽在移动系统中占主导地位。
- 在二进制大小和带宽受限的移动应用中实现复杂模型的实际部署。
- 通过结构化、可训练的压缩流水线,在激进压缩过程中保持模型准确率。
提出的方法
- 应用基于大小的剪枝,移除低权重连接,使参数量减少9倍至13倍,同时保持准确率。
- 应用训练后量化:将权重分组为聚类(例如,全连接层使用32个质心),仅存储质心和索引,并通过微调恢复准确率。
- 对压缩后的索引和质心使用霍夫曼编码,进一步减少存储空间,实现35倍至49倍的总压缩率。
- 使用压缩稀疏行(CSR)或列(CSC)格式表示稀疏权重矩阵,并采用相对索引编码以减少元数据开销。
- 仅存储码书(共享权重值)、索引(聚类分配)和压缩后的元数据,最大限度减少存储膨胀。
- 在剪枝和量化后应用微调,以优化剩余权重和质心,确保无准确率下降。
实验结果
研究问题
- RQ1能否通过剪枝、量化和编码的结合,实现35倍至49倍的深度神经网络压缩,且无准确率损失?
- RQ2在联合流水线中同时应用剪枝和量化是否比顺序应用获得更高的压缩率?
- RQ3压缩后的模型能否完全容纳在片上SRAM中,从而减少对能耗较高的DRAM的依赖?
- RQ4压缩对CPU、GPU和移动GPU平台上推理速度和能效的影响如何?
- RQ5该方法能否在AlexNet、VGG-16和LeNet等不同网络架构上通用,且无准确率下降?
主要发现
- 深度压缩将AlexNet的模型大小从240MB减少至6.9MB(35倍压缩),在ImageNet上无准确率损失。
- VGG-16被压缩为11.3MB(49倍压缩),同样无准确率下降。
- LeNet被压缩39倍,且无准确率损失,证明了该方法在不同架构上的泛化能力。
- 压缩后的模型在各层实现3倍至4倍的速度提升,并在CPU、GPU和移动GPU平台上实现3倍至7倍的能效提升。
- 最终模型完全存储于片上SRAM中(5pJ/访问),避免了片外DRAM访问(640pJ/访问),显著降低能耗。
- 该方法在压缩比和准确率保持方面优于先前工作:仅剪枝与量化即实现27倍至31倍压缩,加入霍夫曼编码后达到35倍至49倍。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。