QUICK REVIEW

[论文解读] Rethinking the Value of Labels for Improving Class-Imbalanced Learning

Yuzhe Yang, Zhi Xu|arXiv (Cornell University)|Jun 13, 2020

Imbalanced Data Classification Techniques参考文献 56被引用 212

一句话总结

本文分析不平衡标签在类不平衡学习中的潜在益处，表明半监督和自监督方法都能提升性能，并包含理论和大规模实验。

ABSTRACT

Real-world data often exhibits long-tailed distributions with heavy class imbalance, posing great challenges for deep recognition models. We identify a persisting dilemma on the value of labels in the context of imbalanced learning: on the one hand, supervision from labels typically leads to better results than its unsupervised counterparts; on the other hand, heavily imbalanced data naturally incurs "label bias" in the classifier, where the decision boundary can be drastically altered by the majority classes. In this work, we systematically investigate these two facets of labels. We demonstrate, theoretically and empirically, that class-imbalanced learning can significantly benefit in both semi-supervised and self-supervised manners. Specifically, we confirm that (1) positively, imbalanced labels are valuable: given more unlabeled data, the original labels can be leveraged with the extra data to reduce label bias in a semi-supervised manner, which greatly improves the final classifier; (2) negatively however, we argue that imbalanced labels are not useful always: classifiers that are first pre-trained in a self-supervised manner consistently outperform their corresponding baselines. Extensive experiments on large-scale imbalanced datasets verify our theoretically grounded strategies, showing superior performance over previous state-of-the-arts. Our intriguing findings highlight the need to rethink the usage of imbalanced labels in realistic long-tailed tasks. Code is available at https://github.com/YyzHarry/imbalanced-semi-self.

研究动机与目标

在真实世界数据中理解标签信息在极端类别不平衡下的表现。
从理论上分析不平衡标签的正负面特征。
提出半监督和自监督策略，以利用不平衡标签在长尾任务上提升性能。
通过在 CIFAR-10/100-LT、SVHN-LT、ImageNet-LT 和 iNaturalist 2018 上的大规模实验来验证理论。

提出的方法

从理论上使用高斯混合模型对不平衡学习进行建模，以研究在不平衡标签之上存在的未标记数据和伪标签。
提出一个半监督框架，利用未标记数据上的伪标签来缓解标签偏置。
提出一个自监督预训练（SSP）阶段，在标准训练前初始化模型且不使用标签。
在具有不同不平衡比的长尾基准上进行经验性评估，比较 SSL 和 SSP。
使用 t-SNE 可视化来说明边界形状和类别分离的改进。
证明 SSL/SSP 与现有不平衡学习技术的兼容性。

实验结果

研究问题

RQ1带伪标签的未标记数据是否可以在半监督设置下减少标签偏置并改善不平衡学习？
RQ2半监督学习是否在不同不平衡比和数据集上提供稳定的增益？
RQ3自监督预训练（SSP）在不使用标记数据的情况下，是否能为不平衡学习提供稳健的改进？
RQ4未标记数据的特性（大小和不平衡性）如何影响长尾任务中的半监督增益？
RQ5SSP 的增益在小规模和大规模不平衡基准上是否具有一致性？

主要发现

在不平衡设置下，带伪标签的未标记数据相对于有监督基线有显著提升，在极端不平衡下可提升约10个百分点左右。
更平衡的未标记数据和更大规模的未标记数据池通常带来更大的 SSL 增益，尽管效果取决于原始数据的不平衡程度。
自监督预训练（SSP）在多种基线和数据集上持续提升性能，往往能匹配甚至超越使用标记数据的 SSL 方法。
在高维设置中，即使训练数据不平衡，SSP 也能通过学习更具标签无关的表征，带来类似指数级的改进。
在 CIFAR-10-LT、CIFAR-100-LT、ImageNet-LT 和 iNaturalist 2018 上，SSP 在若干配置中实现了新的 state-of-the-art。
定性分析（t-SNE）显示 SSP 和 SSL 能带来更清晰的尾部类别分离和更鲁棒的决策边界。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。