QUICK REVIEW

[論文レビュー] Sample Selection with Uncertainty of Losses for Learning with Noisy Labels

Xiaobo Xia, Tongliang Liu|arXiv (Cornell University)|Jun 1, 2021

Machine Learning and Data Classification参考文献 69被引用数 48

ひとこと要約

本論文は CNLCU を提案します。損失の区間推定に基づく不確実性を用いたサンプル選択法で、ノイズ付きラベルを頑健に学習し、バランスデータ/不均衡データおよび実世界のノイズに対する頑健性を向上させます。

ABSTRACT

In learning with noisy labels, the sample selection approach is very popular, which regards small-loss data as correctly labeled during training. However, losses are generated on-the-fly based on the model being trained with noisy labels, and thus large-loss data are likely but not certainly to be incorrect. There are actually two possibilities of a large-loss data point: (a) it is mislabeled, and then its loss decreases slower than other data, since deep neural networks "learn patterns first"; (b) it belongs to an underrepresented group of data and has not been selected yet. In this paper, we incorporate the uncertainty of losses by adopting interval estimation instead of point estimation of losses, where lower bounds of the confidence intervals of losses derived from distribution-free concentration inequalities, but not losses themselves, are used for sample selection. In this way, we also give large-loss but less selected data a try; then, we can better distinguish between the cases (a) and (b) by seeing if the losses effectively decrease with the uncertainty after the try. As a result, we can better explore underrepresented data that are correctly labeled but seem to be mislabeled at first glance. Experiments demonstrate that the proposed method is superior to baselines and robust to a broad range of label noise types.

研究の動機と目的

小さな損失の選択が信頼できない場合があるラベルノイズ下で、頑健な学習を動機づける。
点損失ではなく区間推定を用いて損失の不確実性を組み込む。
時間を通じて損失を集約する頑健な平均推定量（ソフトトランケーションとハードトランケーション）を開発する。
一般化性能を高めるため、サンプル数が少ないが潜在的に正しくラベル付けされたデータの選択を促進する。
合成のバランス/不均衡データセットおよび実世界のノイズデータに対する有効性を示す。

提案手法

訓練損失を反復を通じて時間発展する（マルコフ）過程としてモデル化する。
時間間隔を拡張し、複数のイテレーションにわたって損失を集約して選択を安定化させる。
対数ベースの影響関数を用いた頑健平均推定量によるソフトトランケーションを導入する。
KNNベースの外れ値除去を用いたハードトランケーションを導入し、頑健な平均推定量を得る。
ソフトおよびハード推定量の濃度境界を導出して保守的な選択基準を得る。
各ネットワークが自分のPeerの訓練用サブセットを選択する二ネットワーク共訓練フレームワークを用いる（アルゴリズム1 CNLCU）。

実験結果

リサーチクエスチョン

RQ1損失の不確実性を利用して、ノイズ付きラベルの下でサンプル選択を改善できるか？
RQ2頑健な平均推定量と保守的な境界は、さまざまなノイズタイプやクラス不均衡に対する頑健性を高めるか？
RQ3CNLCUは、合成データおよび実世界のノイズデータセットに対して既存のサンプル選択法とどう比較されるか？
RQ4過小選択された大きな損失データを探索して、過小表示されたクリーンな例を回復することは有益か？
RQ5異なる訓練間隔とノイズレジーム下で、ソフトトランケーションとハードトランケーション戦略はどう機能するか？

主な発見

CNLCU-SとCNLCU-Hは、複数のノイズタイプとレベル下でMNIST、F-MNIST、CIFARデータセットに対して優位か競争力のある精度を達成する。
提案手法は不均衡ノイズデータと広いノイズタイプに対して頑健であり、主要な設定でいくつかのベースラインを上回る。
ソフトトランケーションとハードトランケーションは、頑健な平均推定と外れ値除去を通じて、損失ベースのサンプル選択の安定性を向上させる。
CNLCUは不均衡な合成データセットで顕著な利得を生み、過小表示されたクラスの利用向上を示す。
Clothing1M の実験では CNLCU の派生が Best および Last 指標で JoCor を上回るが、常に最先端の最良 backbone に到達するわけではない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。