QUICK REVIEW

[论文解读] Self-supervised Learning is More Robust to Dataset Imbalance

Hong Liu, Jeff Z. HaoChen|arXiv (Cornell University)|Oct 11, 2021

Domain Adaptation and Few-Shot Learning参考文献 68被引用 63

一句话总结

论文表明自监督学习（SSL）表示对类别不平衡比监督表示更鲁棒，给出理论与经验解释，并引入基于重加权的正则化以进一步提升在不平衡数据上的SSL性能。

ABSTRACT

Self-supervised learning (SSL) is a scalable way to learn general visual representations since it learns without labels. However, large-scale unlabeled datasets in the wild often have long-tailed label distributions, where we know little about the behavior of SSL. In this work, we systematically investigate self-supervised learning under dataset imbalance. First, we find out via extensive experiments that off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations. The performance gap between balanced and imbalanced pre-training with SSL is significantly smaller than the gap with supervised learning, across sample sizes, for both in-domain and, especially, out-of-domain evaluation. Second, towards understanding the robustness of SSL, we hypothesize that SSL learns richer features from frequent data: it may learn label-irrelevant-but-transferable features that help classify the rare classes and downstream tasks. In contrast, supervised learning has no incentive to learn features irrelevant to the labels from frequent examples. We validate this hypothesis with semi-synthetic experiments and theoretical analyses on a simplified setting. Third, inspired by the theoretical insights, we devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets with several evaluation criteria, closing the small gap between balanced and imbalanced datasets with the same number of examples.

研究动机与目标

研究在预训练时类别不平衡如何影响自监督表示。
在领域内与领域外评估中比较SSL与监督预训练的鲁棒性。
为不平衡条件下的SSL鲁棒性提供理论与经验解释。
提出一种重新加权的正则化技术以提升不平衡数据上的SSL性能。

提出的方法

系统性评估在CIFAR-10与ImageNet上使用不同的不平衡比率和样本量的SSL（MoCo v2与SimSiam）与监督预训练。
通过在平衡的领域内数据上进行线性探测以及在下游领域外数据上进行微调来评估表示质量。
提供一个 toy 理论情景以对比SSL与监督学习在不平衡条件下学习的特征。
进行半合成实验以可视化SSL与SL学习的可迁移特征与标签相关特征的差异。
引入带核密度权重的再加权尖性最小化（rwSAM）以在不平衡数据上改进SSL表示质量。

实验结果

研究问题

RQ1数据集不平衡如何影响ID与OOD评估中SSL相对于监督表示的质量？
RQ2为何SSL tendency 从常见类别学习出更具可迁移性的特征，有助于少数类别？
RQ3再加权正则化是否能在不平衡数据集上提升SSL性能，以及对罕见样本的泛化有何影响？
RQ4理论 toy 设置与半合成实验是否支持SSL捕捉到与标签无关但可迁移的特征的主张？

主要发现

在多种配置下，SSL表示对类别不平衡比监督表示更鲁棒。
在ID和OOD评估中，平衡预训练与不平衡预训练之间的鲁棒性差距对SSL而言较SL更小。
SSL倾向于从频繁类别学习更丰富、可迁移的特征，能够帮助罕见类别及下游任务。
重新加权的尖性最小化（rwSAM）方法在不平衡数据集上普遍提升SSL表示质量，并缩小与平衡数据之间的差距。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。