QUICK REVIEW

[論文レビュー] When Does Label Smoothing Help?

Rafael Rios Müller, Simon Kornblith|arXiv (Cornell University)|Jun 6, 2019

Time Series Analysis and Forecasting参考文献 17被引用数 884

ひとこと要約

本論文は label smoothing が一般化、キャリブレーション、および knowledge distillation に与える影響を分析し、キャリブレーションと一般化を向上させる一方で、ログits における情報消去のため蒸留を損なう可能性があることを示している。

ABSTRACT

The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

研究の動機と目的

なぜおよびいつ label smoothing がニューラルネットワークの性能を改善するのかを調査する。
label smoothing が penultimate-layer 表現をどのように変えるかを特徴づける。
さまざまなタスクにおける model calibration への label smoothing の影響を評価する。
label smoothing が knowledge distillation および情報伝達に与える影響を検討する。

提案手法

次元削減による penultimate-layer 活性化の可視化手法を導入する。
期待キャリブレーション誤差 (ECE) と信頼性図を用いて calibration を定量化する。
label smoothing の有無で、画像分類および翻訳タスクにおける calibration と精度を評価する。
teacher–student 設定を用いた label smoothing が knowledge distillation に与える影響を分析する。
label smoothing の下で情報の保存を検討するために、入力と logits の間の相互情報を推定する。

実験結果

リサーチクエスチョン

RQ1label smoothing はモデルの calibration を改善し、それが beam-search のような下流タスクにも良い影響を与えるか？
RQ2label smoothing は penultimate-layer 表現をどのように再構成するか？
RQ3なぜ label smoothing は teacher の accuracy を改善しても knowledge distillation を妨げるのか？
RQ4label smoothing、相互情報、および情報圧縮の関係は何か？

主な発見

Label smoothing は calibration を改善し、予測の過度な自信を抑えることができる。
Label smoothing は penultimate-layer activation において、より凝縮された等間隔のクラスタを生み出し、クラス間での情報抹消効果を示している。
Label smoothing は翻訳タスクにおける BLEU と calibration を改善する一方、hard targets と比較して NLL は悪化する。
label smoothing で訓練された教師からの蒸留は、hard targets で訓練された教師からの蒸留よりも性能が劣る可能性がある。理由は logit 情報の損失による。
入力と logit の差の間の相互情報は label smoothing によって減少し、表現における情報抹消を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。