QUICK REVIEW

[论文解读] Learning with Bounded Instance- and Label-dependent Label Noise

Jiacheng Cheng, Tongliang Liu|arXiv (Cornell University)|Sep 12, 2017

Machine Learning and Data Classification参考文献 63被引用 28

一句话总结

本文提出了一种新颖的学习算法，用于处理有界实例相关与标签相关标签噪声（Bounded Instance- and Label-dependent Label Noise, BILN），其中噪声率受上界限制，并随样本和标签而变化。通过引入‘提炼样本’——即其标签与贝叶斯最优分类器预测一致的数据点——该方法实现了统计一致性与鲁棒性，实证结果表明，在不同噪声条件下，该方法在合成数据集和真实世界数据集上的表现均优于基线方法。

ABSTRACT

Instance- and Label-dependent label Noise (ILN) widely exists in real-world datasets but has been rarely studied. In this paper, we focus on Bounded Instance- and Label-dependent label Noise (BILN), a particular case of ILN where the label noise rates -- the probabilities that the true labels of examples flip into the corrupted ones -- have upper bound less than $1$. Specifically, we introduce the concept of distilled examples, i.e. examples whose labels are identical with the labels assigned for them by the Bayes optimal classifier, and prove that under certain conditions classifiers learnt on distilled examples will converge to the Bayes optimal classifier. Inspired by the idea of learning with distilled examples, we then propose a learning algorithm with theoretical guarantees for its robustness to BILN. At last, empirical evaluations on both synthetic and real-world datasets show effectiveness of our algorithm in learning with BILN.

研究动机与目标

为有界实例相关与标签相关标签噪声（BILN）这一更现实但研究不足的标签噪声形式，解决其理论与算法方案的缺失问题。
建立对BILN的鲁棒性理论保证，包括统计一致性与性能边界。
开发一种实用的学习算法，利用提炼样本在BILN下收敛至贝叶斯最优分类器。
在合成数据集与真实世界数据集上对算法进行实证评估，证明其在不同噪声率下且无需事先知晓噪声边界时的有效性。

提出的方法

在假设此类样本存在且可识别的前提下，引入‘提炼样本’的概念——即其标签与贝叶斯最优分类器预测一致的样本。
提出一种基于提炼样本训练的学习算法，以在BILN下实现向贝叶斯最优分类器的收敛。
通过使用超参数 $ k $ 从噪声模型中选择激活度最高的样本，实现无需事先知晓噪声边界即可识别提炼样本的方法。
采用一种算法变体，通过迭代优化与高置信度预测的主动选择，估计噪声率。
应用理论分析证明在BILN下的统计一致性，并推导出泛化误差边界。
采用基于锚点与置信度阈值的噪声率估计策略，以识别并过滤误标样本。

实验结果

研究问题

RQ1当训练数据受有界实例相关与标签相关标签噪声（BILN）污染时，学习算法能否实现统计一致性？
RQ2在未知噪声率的情况下，如何识别并利用提炼样本以提升对BILN的鲁棒性？
RQ3在不同水平的实例相关与标签相关噪声下，所提算法相较于现有方法的性能如何？
RQ4当噪声边界未知时，算法对超参数 $ k $ 的选择有多敏感？
RQ5所提方法能否推广至具有复杂非均匀噪声模式的真实世界数据集？

主要发现

在噪声率为 (0.49, 0.49) 的合成数据集上，所提算法达到 99.23% 的准确率，显著优于基线方法如 peer loss（89.10%）与 noisy+act（92.36%）。
在 UCI Image 数据集上，噪声率为 (0.5, 0.5) 时，算法达到 74.51% 的准确率，优于 peer loss（64.61%）与 noisy+act（69.45%）。
在 USPS（6vs8）数据集上，噪声率为 (0.5, 0.5) 时，算法达到 83.40% 的准确率，超过 peer loss（82.52%）与 noisy+act（77.95%）。
无需事先知晓 $ \rho_{+1\text{max}} $ 与 $ \rho_{-1\text{max}} $ 的算法变体——'Algo. 1 w/o $ \rho_{\pm 1\text{max}} $'——在性能上达到或优于已知噪声边界版本。
该算法对超参数 $ k $ 具有鲁棒性，图 2 中的性能曲线显示，在所有三个数据集上，不同 $ k $ 值下的准确率均保持稳定。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。