QUICK REVIEW

[论文解读] Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

Asit Mishra, Debbie Marr|arXiv (Cornell University)|Nov 15, 2017

Sensor Technology and Measurement Systems被引用 157

一句话总结

本论文展示了知识蒸馏如何显著提升低精度DNN的准确性，通过三种蒸馏方案在ImageNet上实现对三元和4位ResNet的最先进结果。

ABSTRACT

Deep learning networks have achieved state-of-the-art accuracies on computer vision workloads like image classification and object detection. The performant systems, however, typically involve big models with numerous parameters. Once trained, a challenging aspect for such top performing models is deployment on resource constrained inference systems - the models (often deep networks or wide networks or both) are compute and memory intensive. Low-precision numerics and model compression using knowledge distillation are popular techniques to lower both the compute requirements and memory footprint of these deployed models. In this paper, we study the combination of these two techniques and show that the performance of low-precision networks can be significantly improved by using knowledge distillation techniques. Our approach, Apprentice, achieves state-of-the-art accuracies using ternary precision and 4-bit precision for variants of ResNet architecture on ImageNet dataset. We present three schemes using which one can apply knowledge distillation techniques to various stages of the train-and-deploy pipeline.

研究动机与目标

证明量化与知识蒸馏相结合如何提高低精度DNN的准确性。
量化蒸馏在ImageNet上对ResNet-18/34/50/101的增益。
提出三种在低精度网络训练和部署期间应用蒸馏的实用方案。
与现有低精度方法进行比较，并为不足8位的网络确立新的最先进结果。

提出的方法

定义一个教师-学生（学徒）框架，其中教师为全精度，学徒为低精度。
对权重（三元或4位）和激活（8位/4位）进行量化，并保留首尾层。
提出三种方案：A) 教师和学徒的联合训练；B) 使用固定的教师 logits 训练学徒；C) 将预训练的全精度学徒在降低精度后进行微调。
使用由真实标签、教师 logits 和学徒 logits 组成的损失，并带有经过校准的权重（α=1，β=0.5，γ=0.5）。
在ImageNet上对ResNet骨架（18、34、50、101）在各种精度配置下进行评估。
与 TTQ 和 WRPN 基线进行比较并报告改进。
讨论超参数选择及对教师-学生引导的饱和效应的观察。

实验结果

研究问题

RQ1知识蒸馏是否能显著恢复或超过ImageNet上低精度网络的准确性？
RQ2三种不同蒸馏方案在改进三元和4位ResNet模型方面的比较？
RQ3教师容量和目标精度对最终学徒性能的影响？
RQ4这些增益是否在不同的ResNet深度（18、34、50）以及训练与微调设置中都成立？

主要发现

三种蒸馏方案为三元和4位权重在ResNet变体上提供了最先进的准确性。
方案A（联合训练）通常提供最强的增益，全精度教师引导低精度学生。
方案B 在提高收敛速度方面（更少的训练轮数），同时达到可比的准确性。
方案C（在降低精度后对预训练的全精度模型进行微调）在某些配置中获得略微更好的结果，例如使用三元权重的ResNet-50。
在所测试的配置中，学徒显著缩小与全精度的差距，并优于先前的低精度基线（如 TTQ、Mellempudi 等人）。
三元网络在模型大小上具有竞争力，且在多种情况下实现了显著的准确性回升，接近全精度准确度的约1%之内。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。