QUICK REVIEW

[论文解读] Sample Selection with Uncertainty of Losses for Learning with Noisy Labels

Xiaobo Xia, Tongliang Liu|arXiv (Cornell University)|Jun 1, 2021

Machine Learning and Data Classification参考文献 69被引用 48

一句话总结

本文提出 CNLCU，一种样本选择方法，利用损失的区间不确定性在带噪声标签的情况下进行稳健训练，从而在平衡/不平衡数据及真实场景噪声下提升鲁棒性。

ABSTRACT

In learning with noisy labels, the sample selection approach is very popular, which regards small-loss data as correctly labeled during training. However, losses are generated on-the-fly based on the model being trained with noisy labels, and thus large-loss data are likely but not certainly to be incorrect. There are actually two possibilities of a large-loss data point: (a) it is mislabeled, and then its loss decreases slower than other data, since deep neural networks "learn patterns first"; (b) it belongs to an underrepresented group of data and has not been selected yet. In this paper, we incorporate the uncertainty of losses by adopting interval estimation instead of point estimation of losses, where lower bounds of the confidence intervals of losses derived from distribution-free concentration inequalities, but not losses themselves, are used for sample selection. In this way, we also give large-loss but less selected data a try; then, we can better distinguish between the cases (a) and (b) by seeing if the losses effectively decrease with the uncertainty after the try. As a result, we can better explore underrepresented data that are correctly labeled but seem to be mislabeled at first glance. Experiments demonstrate that the proposed method is superior to baselines and robust to a broad range of label noise types.

研究动机与目标

在标签噪声下激励稳健学习，因为小损失选择在某些情况下不可靠。
使用区间估计（而非点损失）来包含损失的不确定性。
开发鲁棒均值估计器（软截断与硬截断）以在时间上聚合损失。
鼓励选择采样较少但潜在标注正确的数据以提高泛化。
在合成平衡/不平衡数据以及真实世界带噪声数据集上展示有效性。

提出的方法

将训练损失建模为跨迭代的时间演化（马尔可夫）过程。
扩展时间间隔并在多次迭代中聚合损失以稳定选择。
引入采用带对数形式影响函数的鲁棒均值估计器的软截断。
引入带基于 KNN 的离群点移除的硬截断以实现鲁棒均值估计。
推导软/硬估计量的集中界，以获得保守的选择标准。
使用双网络协同训练框架，其中每个网络为其对等网络选择一个子集样本进行训练（算法 1 CNLCU）。

实验结果

研究问题

RQ1是否可以利用损失不确定性来改进在带噪声标签下的样本选择？
RQ2鲁棒均值估计器和保守界限是否能提升对多种噪声类型与类别不平衡的鲁棒性？
RQ3在合成与真实世界带噪声数据集上，CNLCU 与现有样本选择方法的比较如何？
RQ4是否有利于探索大损失但被低采样的数据，以恢复被低代表性的干净样本？
RQ5在不同的训练区间和噪声情景下，软截断与硬截断策略的表现如何？

主要发现

CNLCU-S 与 CNLCU-H 在多种噪声类型和水平下，在 MNIST、F-MNIST 与 CIFAR 数据集上达到优越或具有竞争力的准确率。
所提出的方法对不平衡噪声数据和广义噪声类型表现出鲁棒性，在关键设置中优于若干基线。
软截断和硬截断通过鲁棒均值估计与离群点移除，提高基于损失的样本选择的稳定性。
CNLCU 在不平衡的合成数据集上获得显著提升，表明更好地利用了代表性不足的类别。
在 Clothing1M 的实验中，CNLCU 变体在 Best 和 Last 指标上优于 JoCor，尽管并不总是达到最强的前沿骨干网络。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。