QUICK REVIEW

[論文レビュー] Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

Chenglin Yang, Lingxi Xie|arXiv (Cornell University)|May 15, 2018

Online Learning and Analytics参考文献 41被引用数 67

ひとこと要約

この論文は、トップスコア差損失を用いて二次クラスに対する信頼をソフトに分配する tolerant teacher で世代ごとにニューラルネットワークを訓練することを提案し、学生がクラス間の類似性を学習し CIFAR100 と ILSVRC2012 でベースラインを上回る。

ABSTRACT

We focus on the problem of training a deep neural network in generations. The flowchart is that, in order to optimize the target network (student), another network (teacher) with the same architecture is first trained, and used to provide part of supervision signals in the next stage. While this strategy leads to a higher accuracy, many aspects (e.g., why teacher-student optimization helps) still need further explorations. This paper studies this problem from a perspective of controlling the strictness in training the teacher network. Existing approaches mostly used a hard distribution (e.g., one-hot vectors) in training, leading to a strict teacher which itself has a high accuracy, but we argue that the teacher needs to be more tolerant, although this often implies a lower accuracy. The implementation is very easy, with merely an extra loss term added to the teacher network, facilitating a few secondary classes to emerge and complement to the primary class. Consequently, the teacher provides a milder supervision signal (a less peaked distribution), and makes it possible for the student to learn from inter-class similarity and potentially lower the risk of over-fitting. Experiments are performed on standard image classification tasks (CIFAR100 and ILSVRC2012). Although the teacher network behaves less powerful, the students show a persistent ability growth and eventually achieve higher classification accuracies than other competitors. Model ensemble and transfer feature extraction also verify the effectiveness of our approach.

研究の動機と目的

教師-生徒最適化が単なる教師の精度を超えて有効である理由を動機づける。
監督信号におけるクラス間の類似性を保持する tolerant-teacher メカニズムを導入する。
有用な二次情報を生成する Top-Score-Difference (TSD) 損失を提案・評価する。
Dist^C および Dist^S 指標を用いて二次情報が学習ダイナミクスに与える影響を定量化する。
標準データセット（CIFAR100 および ILSVRC2012）での生成ベースの訓練における学生の性能向上を実証する。

提案手法

最適化を、祖父母のような初期教師（パトリアーク）と連なる学生たちによる世代訓練として位置づける。
実データと教師の指導を組み合わせた混合監督損失を用いる（式2）。
出力分布をソフト化してトップ-K の二次クラス方式を用いることで二次情報を保持する tolerant teacher の目的関数を導入する（式3）。
K、u(η) および λ で手法をパラメータ化する；K = 5 を設定し、安定性のため η の代わりに u(η) を用いる（式3.4）。
CIFAR100 および ILSVRC2012 の各データセットで、ベースラインのワンホット訓練、ラベルスムージング、信頼度ペナルティと、TSD バリアントを比較する。
Dist^C および Dist^S 指標を用いて二次情報の品質を評価し、それを最終精度と関連付けて検討する。

実験結果

リサーチクエスチョン

RQ1二次情報を保持する tolerant teacher で訓練すると、生成ベースの学習における学生の精度が向上するか。
RQ2学生の利益を最大化するには、教師のソフト化された分布をどのように設計すべきか（どの二次クラスを強調するか）？
RQ3生成ベースの訓練でより良い学生成績と相関する定量的指標（Dist^C, Dist^S）は何か？
RQ4CIFAR 系設定を超えて、生成ベース・ tolerant-teacher 手法は ILSVRC2012 のような大規模データセットにも転移するか？
RQ5アーキテクチャ間で最良の利得を得るための最適なハイパーパラメータ（K、η を介しての u(η)、λ）は何か？

主な発見

二次情報を保持する tolerant teacher は数世代にわたり学生に持続的な利益をもたらし、しばしばパトリアークのベースラインを上回る。
CIFAR100 での最良の利得は TSD-0.6 により達成され、ベースラインおよび他の誤差よりも最終テスト精度が高くなる；より深いアーキテクチャの CNN も同様の恩恵を受ける。
CIFAR100 では tolerant-teacher バリアントの最高報告精度は 73.72%、ベースラインは報告中の約 71.5%–72.5% に対して、アンサンブルでさらに結果を向上。
ILSVRC2012 (ResNet-18) では tolerant-teacher 変種 D(0.6,0.6) がトップ1 を約 30.50% から 29.60% に、トップ5 を 11.07% から 10.11% に改善し、最良生成で得られる。アンサンブルの結果もさらに利益を継続。
DenseNets (100/190 層) は D(0.6,0.6) または D(0.7,0.6) を用いると、単一モデルで 1–2% の利得、アンサンブルで 5% 以上の利得を示し、追加の推論時間コストなしで最先端に近づく。
研究は Dist^C および Dist^S を粗いレベルと意味的レベルでのクラス判別性の指標として導入し、より高い Dist^S が粗いレベルの学習を、低い Dist^C が意味のある二次情報を結びつける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。