QUICK REVIEW

[論文レビュー] Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective

Helong Zhou, Liangchen Song|arXiv (Cornell University)|Feb 1, 2021

Machine Learning and Algorithms参考文献 40被引用数 44

ひとこと要約

知識蒸馏におけるソフトラベルがサンプルごとのバイアス-分散トレードオフを誘発することを分析し、これを適応的にバランスさせる重み付きソフトラベルを導入する。標準ベンチマークでの実験により妥当性を検証。

ABSTRACT

Knowledge distillation is an effective approach to leverage a well-trained network or an ensemble of them, named as the teacher, to guide the training of a student network. The outputs from the teacher network are used as soft labels for supervising the training of a new network. Recent studies \citep{muller2019does,yuan2020revisiting} revealed an intriguing property of the soft labels that making labels soft serves as a good regularization to the student network. From the perspective of statistical learning, regularization aims to reduce the variance, however how bias and variance change is not clear for training with soft labels. In this paper, we investigate the bias-variance tradeoff brought by distillation with soft labels. Specifically, we observe that during training the bias-variance tradeoff varies sample-wisely. Further, under the same distillation temperature setting, we observe that the distillation performance is negatively associated with the number of some specific samples, which are named as regularization samples since these samples lead to bias increasing and variance decreasing. Nevertheless, we empirically find that completely filtering out regularization samples also deteriorates distillation performance. Our discoveries inspired us to propose the novel weighted soft labels to help the network adaptively handle the sample-wise bias-variance tradeoff. Experiments on standard evaluation benchmarks validate the effectiveness of our method. Our code is available at \url{https://github.com/bellymonster/Weighted-Soft-Label-Distillation}.

研究の動機と目的

バイアス-分散の視点からKDにおけるソフトラベルの分析を動機づける。
KD訓練中にサンプルごとにバイアスと分散がどのように進化するかを特徴づける。
KD性能に不均衡に影響を与える正則化サンプルを特定する。
学習中のサンプルごとのバイアス-分散を適応的に管理する重み付きソフトラベルを提案・検証する。

提案手法

KLダイバージェンスに基づく解析を用いてKD損失をバイアス-分散成分に分解する。
直接訓練（クロスエントロピー）と蒸留損失（KD）とのバイアス-分散分解を比較する。
分散削減が優位でバイアスが増加する正則化サンプルが存在することを示す。
教師と学生の予測に基づく温度に依存しない重み付けスキームを導入する（重み付きソフトラベル）。
学習には L_ce を重み付きKD損失 (L_wsl) と組み合わせ、平衡ハイパーパラメータ α を用いる。

実験結果

リサーチクエスチョン

RQ1知識蒸馏でソフトラベルを使用する場合、訓練中にバイアスと分散はどのように進化するか。
RQ2固定蒸留温度の下でKD性能における正則化サンプルの役割は何か。
RQ3サンプルごとの重み付け方式は正則化サンプルの悪影響を緩和し、KD性能を改善できるか。

主な発見

ソフトラベルは監督信号であると同時に正則化要素であり、サンプルごとのバイアス-分散トレードオフを生む。
同じ温度の下で、サブセットのサンプル（正則化サンプル）は、バイアスの増加と分散の利得の低下によりKD性能と負の相関を示す。
正則化サンプルを完全に除外すると性能が低下し、これらにはKDに利用可能な情報が含まれることを示唆する。
簡単な重み付けソフトラベル方式（L_wsl）は正則化サンプルの悪影響を緩和し、KD性能を改善する。
CIFAR-100とImageNetでの実験は、さまざまな教師-学生ペアに対して最先端のKD手法と競合する、または優れている結果を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。