QUICK REVIEW

[論文レビュー] Knowledge Distillation via Route Constrained Optimization

Jin Xiao, Baoyun Peng|arXiv (Cornell University)|Apr 19, 2019

Advanced Neural Network Applications参考文献 34被引用数 36

ひとこと要約

RCOは教師のトレーニング経路に沿ったeasy-to-hardなシーケンスの教師チェックポイントを追従して生徒を訓練し、CIFAR-100、ImageNet、MegaFaceで標準の知識蒸留を上回る。アンカー点ガイダンスを用いて一貫性ギャップを縮小する。

ABSTRACT

Distillation-based learning boosts the performance of the miniaturized neural network based on the hypothesis that the representation of a teacher model can be used as structured and relatively weak supervision, and thus would be easily learned by a miniaturized model. However, we find that the representation of a converged heavy model is still a strong constraint for training a small student model, which leads to a high lower bound of congruence loss. In this work, inspired by curriculum learning we consider the knowledge distillation from the perspective of curriculum learning by routing. Instead of supervising the student model with a converged teacher model, we supervised it with some anchor points selected from the route in parameter space that the teacher model passed by, as we called route constrained optimization (RCO). We experimentally demonstrate this simple operation greatly reduces the lower bound of congruence loss for knowledge distillation, hint and mimicking learning. On close-set classification tasks like CIFAR100 and ImageNet, RCO improves knowledge distillation by 2.14% and 1.5% respectively. For the sake of evaluating the generalization, we also test RCO on the open-set face recognition task MegaFace.

研究の動機と目的

教師の最適化経路の視点から知識蒸留を再考する。
教師の軌道に沿ったアンカーポイントを介して生徒を導くRoute Constrained Optimization (RCO)を提案する。
分類と顔認識タスク全般で、RCOが標準KDおよび最先端手法を上回ることを示す。

提案手法

教師（φ_t）と生徒（φ_s）を定義し、softmax出力とKL/MSE損失を用いて知識を蒸留する。
容量ギャップのため、収束した教師での監督が生徒にとって難しくなることを示す。
教師のトレーニング経路に沿ってアンカーポイント C_1,...,C_n を導入し、各ステップで φ_t(·;W_{C_i}) を模倣するよう生徒を逐次訓練する。
2つのアンカーポイント選択戦略を提供する：Equal Epoch Interval (EEI) と、KL発散に基づいて情報量の多いアンカーを選択するGreedy Search (GS)。
ステップ i における Loss_i を Loss_i = H(y, φ_s(x;W_s)) + λ H(φ_s(x;W_s), φ_t(x;W_{C_i})) と定式化する。
カリキュラム学習の合理性と、教師の軌跡が生み出す緩和-硬化のシーケンスについて論じる。

実験結果

リサーチクエスチョン

RQ1教師のトレーニング経路に沿った中間状態からの監督は、最終的に収束した教師のみを用いる場合と比べて、小さな生徒の一致損失を低減できるか？
RQ2アンカーポイント選択戦略（EEI、GS）は、生徒の性能とトレーニング効率を向上させるカリキュラムを効果的に作成するか？
RQ3RCOの利得は CIFAR-100、ImageNet-1K、MegaFace の顔認識タスク全体で一貫しているか？
RQ4RCOは既存の知識移転手法と組み合わせて、性能をさらに高められるか？

主な発見

実験	モデル	Top-1	Top-5
CIFAR-100 (KD vs RCO)	S-KD (MobileNetV2)	68.71	-
CIFAR-100 (RCO)	S-RCO (MobileNetV2)	70.85	-
ImageNet (Student)	Student-KD (MobileNetV2)	66.75	87.30
ImageNet (RCO)	Student-RCO (MobileNetV2)	68.21	88.04

RCOはKDを約2.14%のtop-1、CIFAR-100で、ImageNet-1Kで1.5%のtop-1向上をもたらす。
MegaFaceの顔認識では、RCOは0.8Mパラメータで84.3%の精度を達成し、既存手法を上回る。
RCOはCIFAR-100、ImageNet、MegaFace全体でKDおよび他のSOTA手法を一貫して上回る。
Greedy Searchアンカー戦略は評価した戦略の中で最良の性能を示し、EEIとGSは学習時間と精度のトレードオフを提供する。
RCOは従来の知識移転手法と組み合わせて、性能をさらに高めることができる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。