QUICK REVIEW

[論文レビュー] Self-Distillation as Instance-Specific Label Smoothing

Zhilu Zhang, Mert R. Sabuncu|arXiv (Cornell University)|Jun 9, 2020

Machine Learning and Data Classification参考文献 41被引用数 52

ひとこと要約

本論文は自己蒸留をMAPフレームワーク内の事例特異的正則化として解釈し、蒸留をラベルスムージングに結びつけ、別個の教員なしで信頼度の多様性を促進する Beta スムージングを導入する。

ABSTRACT

It has been recently demonstrated that multi-generational self-distillation can improve generalization. Despite this intriguing observation, reasons for the enhancement remain poorly understood. In this paper, we first demonstrate experimentally that the improved performance of multi-generational self-distillation is in part associated with the increasing diversity in teacher predictions. With this in mind, we offer a new interpretation for teacher-student training as amortized MAP estimation, such that teacher predictions enable instance-specific regularization. Our framework allows us to theoretically relate self-distillation to label smoothing, a commonly used technique that regularizes predictive uncertainty, and suggests the importance of predictive diversity in addition to predictive uncertainty. We present experimental results using multiple datasets and neural network architectures that, overall, demonstrate the utility of predictive diversity. Finally, we propose a novel instance-specific label smoothing technique that promotes predictive diversity without the need for a separately trained teacher model. We provide an empirical evaluation of the proposed method, which, we find, often outperforms classical label smoothing.

研究の動機と目的

多世代にわたる自己蒸留がなぜ一般化を改善するのかを調査する。
教師-学生トレーニングのMAPベースの解釈を提供する。
蒸留をラベルスムージングと関連づけ、予測の多様性の役割を強調する。
効率的な事例特異的正則化手法として Beta スムージングを提案する。
確率単体上の正則化を通じた較正の改善を示す。

提案手法

蒸留過程をソフトマックス出力のアモルタイズドMAP推定としてモデル化する。
教師の予測を出力分布の事例特異的事前分布に関連づける。
体系的な実験を通じて自己蒸馏と古典的なラベルスムージングを比較する。
別個の教師なしで事例特異的な事前分布を実装するために Beta スムージングを導入する。
エントロピーに基づく指標を用いて予測不確実性と信頼度の多様性を分析する。
データセット全体で期待較正誤差 (ECE) を用いて較正の改善を評価する。

実験結果

リサーチクエスチョン

RQ1教師の予測の多様性を高めることが自己蒸留において学生の性能向上と相関するか？
RQ2自己蒸留はMAPフレームワークを通じてラベルスムージングと理論的に結びつくか？
RQ3事例特異的正則化（Beta スムージングを含む）は従来のラベルスムージングより優れているか？
RQ4Beta スムージングは自己蒸留と同等またはそれ以上の較正効果をもたらすか？
RQ5予測の多様性が一般化と較正の向上に果たす役割は何か？

主な発見

逐次的な自己蒸留は世代をまたいでテスト精度を改善し、より良い較正を示す。
教師予測の多様性が高いほど学生の性能が向上する。
ラベルスムージングは予測不確実性を高めるが、多様性を達成できない場合がある。事例特異的事前分布は助けになる。
Beta スムージングは古典的なラベルスムージングを上回ることが多く、別個の教師なしで自己蒸留に匹敵することがある。
MAPの観点は蒸留を事例特異的正則化の一形態として説明し、較正を改善します。
温度補正された教師予測は不確実性と多様性を制御することで学生の精度を大幅に向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。