QUICK REVIEW

[論文レビュー] Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels

Curtis G. Northcutt, Tailin Wu|arXiv (Cornell University)|May 4, 2017

Machine Learning and Data Classification参考文献 19被引用数 57

ひとこと要約

Rank Pruningは、ノイズのあるラベルを持つ二値分類のための時間効率的な手法で、非対称なノイズ率を共同推定し、誤ラベルの例を除去して、クリーンデータでの訓練と同等のリスクを達成します。

ABSTRACT

Noisy PN learning is the problem of binary classification when training examples may be mislabeled (flipped) uniformly with noise rate rho1 for positive examples and rho0 for negative examples. We propose Rank Pruning (RP) to solve noisy PN learning and the open problem of estimating the noise rates, i.e. the fraction of wrong positive and negative labels. Unlike prior solutions, RP is time-efficient and general, requiring O(T) for any unrestricted choice of probabilistic classifier with T fitting time. We prove RP has consistent noise estimation and equivalent expected risk as learning with uncorrupted labels in ideal conditions, and derive closed-form solutions when conditions are non-ideal. RP achieves state-of-the-art noise estimation and F1, error, and AUC-PR for both MNIST and CIFAR datasets, regardless of the amount of noise and performs similarly impressively when a large portion of training examples are noise drawn from a third distribution. To highlight, RP with a CNN classifier can predict if an MNIST digit is a "one"or "not" with only 0.25% error, and 0.46 error across all digits, even when 50% of positive examples are mislabeled and 50% of observed positive labels are mislabeled negative examples.

研究の動機と目的

tilde-PN学習問題（非対称なラベルノイズを伴う二値分類）を動機づけて形式化し、ノイズ率（rho1, rho0）を推定する必要性を示す。
Rank Pruningを2段階の解法として導入する：（i）自信のある例からノイズ率を推定、（ii）誤ラベルの例を除去して自信のあるサブセットで訓練。
一貫性を証明し、非理想的な条件下で閉形式の結果を導出し、特定の仮定の下でクリーンラベル学習と同等の期待リスクを示す。

提案手法

自信のある例のカウントを定義し、予測確率 g(x)に基づいて hat{rho}_1^{conf} および hat{rho}_0^{conf} を導出する。
欠損セット内で正しくラベル付けされた例と誤ラベルの例とを区別するため、しきい値ベースの分割 LB_{y=1} および UB_{y=0} を計算する。
BFPRTを用いて hat{pi}_1|tilde{P}| および hat{pi}_0|tilde{N}| の例を O(n) 時間で剪定し、混乱した訓練セットを形成する。
剪定データの損失に重みを付けて正/負のバランスを回復し、最終的な分類器を適合させる。
レンジ分離可能性が成り立つ場合、Rank Pruningは未汚染ラベルでの学習と同等の期待リスクを達成することを証明する（定理5）。
フィットにO(T)の実行時間、剪定にO(n)を要し、典型的な分類器では全体としてO(T)となることを示す。

実験結果

リサーチクエスチョン

RQ1Rank Pruningは腐敗データから非対称なノイズ率rho1とrho0を正確に推定できるか。
RQ2理想的条件下で自信度の高い例を剪定すると、クリーンラベルでの訓練と同じ期待リスクを持つ分類器につながるか。
RQ3理想的でない条件（gの不完全性、PとNの重なり、第三分布ノイズの追加など）下で、標準データセットに対してRank Pruningはどう機能するか。
RQ4大規模データセットや複雑なモデルに対して、Rank Pruningは時間効率を確保できるか。

主な発見

Rank Pruningはノイズ推定を堅牢に行い、MNISTおよびCIFARでノイズレベルとノイズ分布が変動してもF1、エラー、AUC-PRで最先端を達成する。
CNNを用いた場合、Rank PruningはMNISTのone-vs-not-oneで0.25%のエラー、全数字で0.46%のエラーを達成し、正例と観測ラベルの最大50%の誤ラベルがあっても堅牢である。
理想的条件下では hat{rho}_1^{conf} = rho1 および hat{rho}_0^{conf} = rho0（一致性）。
非理想条件下では hat{rho}_1^{conf} および hat{rho}_0^{conf} が上限として残り、特定の閾値が満たされる場合推定誤差に対して頑健である（定理4）。
レンジ分離性が成り立ち、ノイズ率が正確に推定される場合、Rank Pruningはクリーンラベルからの学習と同じ期待リスクを与える（定理5）。
アルゴリズムは基礎分類器のフィットにO(T)時間、剪定にO(n)を要するため、大規模問題に実用的である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。