QUICK REVIEW

[论文解读] Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy

Asit Mishra, Debbie Marr|arXiv (Cornell University)|Nov 15, 2017

Advanced Neural Network Applications参考文献 25被引用 154

一句话总结

Apprentice 将量化与知识蒸馏相结合，以提升低精度DNN的准确性，在 ImageNet 上的 ResNet 变体中实现了三值和 4-bit 精度的当前最优结果。

ABSTRACT

Deep learning networks have achieved state-of-the-art accuracies on computer vision workloads like image classification and object detection. The performant systems, however, typically involve big models with numerous parameters. Once trained, a challenging aspect for such top performing models is deployment on resource constrained inference systems - the models (often deep networks or wide networks or both) are compute and memory intensive. Low-precision numerics and model compression using knowledge distillation are popular techniques to lower both the compute requirements and memory footprint of these deployed models. In this paper, we study the combination of these two techniques and show that the performance of low-precision networks can be significantly improved by using knowledge distillation techniques. Our approach, Apprentice, achieves state-of-the-art accuracies using ternary precision and 4-bit precision for variants of ResNet architecture on ImageNet dataset. We present three schemes using which one can apply knowledge distillation techniques to various stages of the train-and-deploy pipeline.

研究动机与目标

Motivate deploying high-accuracy DNNs on resource-constrained inference systems by reducing weight and activation precision.
Investigate whether knowledge distillation can compensate for accuracy loss in low-precision models.
Propose three practical schemes to apply distillation at different stages of training to produce high-accuracy low-precision models.

提出的方法

Use a teacher–student (apprentice) framework where the teacher is full-precision and the student is low-precision.
Propose three schemes: scheme-A joint training of teacher and student; scheme-B distillation from a pre-trained teacher to train the student from scratch; scheme-C fine-tuning a low-precision student starting from a full-precision pretrained state.
Employ a loss function L(x; WT, WA) = α H(y, p^T) + β H(y, p^A) + γ H(z^T, p^A) with α=1, β=0.5, γ=0.5, training on logits of the teacher.
Keep first and last layers at full precision in the apprentice, while quantizing weights and activations in hidden layers (WRPN for 4- or 8-bit).
Experiment with ResNet-18/34/50 as students and ResNet-34/50/101 as teachers; use ternary weights or 4-bit weights with 8-bit activations for low-precision variants.

实验结果

研究问题

RQ1Can knowledge distillation improve the accuracy of low-precision (ternary and 4-bit) networks on ImageNet?
RQ2How do the three distillation schemes compare in terms of final accuracy and training efficiency?
RQ3What are the effects of teacher-student depth and precision configurations on achieving state-of-the-art low-precision performance?
RQ4Is joint training (scheme-A) superior to distillation from a pre-trained teacher (scheme-B) or fine-tuning (scheme-C) for low-precision networks?

主要发现

All three Apprentice schemes yield accuracy improvements over prior low-precision methods for ResNet-18/34/50 on ImageNet-1K.
Scheme-A and Scheme-C achieve state-of-the-art results for ternary and 4-bit weight configurations on several ResNet variants, approaching or matching full-precision baselines in some cases.
For ResNet-34 and ResNet-50, 4-bit weights with 8-bit activations trained via Apprentice reach within about 0.5–1.5 percentage points of full-precision Top-1 error on ImageNet when using appropriate teachers.
Using larger teachers can improve low-precision student accuracy, with diminishing returns beyond a certain teacher size.
Scheme-B reduces training epochs needed to reach comparable accuracy compared to Scheme-A, while Scheme-C can yield marginal gains over Scheme-A in some configurations.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。