QUICK REVIEW

[論文レビュー] Apprentice: Using Knowledge Distillation Techniques To Improve\n Low-Precision Network Accuracy

Asit Mishra, Debbie Marr|arXiv (Cornell University)|Nov 15, 2017

Sensor Technology and Measurement Systems被引用数 157

ひとこと要約

この論文は知識蒸留が低精度の DNN の精度を著しく向上させることを示しており、ImageNet における ternary および 4-bit ResNet の最先端結果を、三つの蒸留スキームを通じて実現している。

ABSTRACT

Deep learning networks have achieved state-of-the-art accuracies on computer\nvision workloads like image classification and object detection. The performant\nsystems, however, typically involve big models with numerous parameters. Once\ntrained, a challenging aspect for such top performing models is deployment on\nresource constrained inference systems - the models (often deep networks or\nwide networks or both) are compute and memory intensive. Low-precision numerics\nand model compression using knowledge distillation are popular techniques to\nlower both the compute requirements and memory footprint of these deployed\nmodels. In this paper, we study the combination of these two techniques and\nshow that the performance of low-precision networks can be significantly\nimproved by using knowledge distillation techniques. Our approach, Apprentice,\nachieves state-of-the-art accuracies using ternary precision and 4-bit\nprecision for variants of ResNet architecture on ImageNet dataset. We present\nthree schemes using which one can apply knowledge distillation techniques to\nvarious stages of the train-and-deploy pipeline.\n

研究の動機と目的

量子化と知識蒸留を組み合わせることで、低精度DNNの精度を改善する方法を実証する。
ImageNet 上の ResNet-18/34/50/101 における蒸留の利得を定量化する。
低精度ネットワークの訓練とデプロイメント中に蒸留を適用する三つの実用的なスキームを提示する。
従来の低精度手法と比較し、8ビット未満ネットワークの新たな最先端結果を確立する。

提案手法

教師-生徒（apprentice）フレームワークを定義する。教師は高精度、apprentice は低精度。
重み（ternary または 4-bit）と活性化（8-bit/4-bit）を量子化し、最初と最後の層を保持する。
三つのスキームを提案する：A) 教師と apprentice の共同訓練; B) 固定された教師の logits で訓練された apprentice; C) 精度を下げた後、事前訓練済みの高精度 apprentice をファインチューニング。
地真実ラベル、教師の logits、および apprentice の logits をキャリブレーションされた重みで組み合わせた損失を使用する（α=1、β=0.5、γ=0.5）。
さまざまな精度設定の下で ImageNet 上で ResNet バックボーン（18、34、50、101）を評価する。
TTQ および WRPN のベースラインと比較し、改良を報告する。
ハイパーパラメータの選択と、教師-学生間の指導の飽和効果について議論する。

実験結果

リサーチクエスチョン

RQ1知識蒸留は ImageNet の低精度ネットワークの精度を実質的に取り戻す、あるいは上回ることができるか。
RQ23つの異なる蒸留スキームは ternary および 4-bit ResNet モデルの改善にどのように寄与するか。
RQ3教師の能力とターゲット精度は最終的な apprentice の性能にどのような影響を与えるか。
RQ4訓練とファインチューニングの設定の下で、ResNet-18/34/50 の異なる深さにも適用可能か。

主な発見

三つの蒸留スキームは、ResNet 系の ternary および 4-bit 重みで最先端の精度を達成する。
スキームA（共同訓練）は、しばしば最も強力な利得を提供し、全精度の教師が低精度の学生を導く。
スキームB は収束を加速させ（エポック数が少なくて済む）、同等の精度に到達する。
スキームC（低精度化後に事前訓練済みの高精度モデルをファインチューニング）は、いくつかの設定でわずかに良い結果をもたらす。例えば ResNet-50 の ternary 重みなど。
全ての検証設定において Apprentice は全精度の精度との差を大きく削減し、従来の低精度ベースライン（例：TTQ、Mellempudi ら）を上回る。
ternary ネットワークは、数件で全精度精度のほぼ1%程度まで回復する大幅な精度を達成し、モデルサイズも競争力を保つ。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。