[论文解读] Class-Balanced Loss Based on Effective Number of Samples
本文引入基于有效样本数的类平衡损失,以解决长尾数据,将样本权重与指数有效样本数相关联,并在 CIFAR、iNaturalist、ImageNet 等数据集上展示收益。
With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while most classes are under-represented). Existing solutions typically adopt class re-balancing strategies such as re-sampling and re-weighting based on the number of observations for each class. In this work, we argue that as the number of samples increases, the additional benefit of a newly added data point will diminish. We introduce a novel theoretical framework to measure data overlap by associating with each sample a small neighboring region rather than a single point. The effective number of samples is defined as the volume of samples and can be calculated by a simple formula $(1-β^{n})/(1-β)$, where $n$ is the number of samples and $β\in [0,1)$ is a hyperparameter. We design a re-weighting scheme that uses the effective number of samples for each class to re-balance the loss, thereby yielding a class-balanced loss. Comprehensive experiments are conducted on artificially induced long-tailed CIFAR datasets and large-scale datasets including ImageNet and iNaturalist. Our results show that when trained with the proposed class-balanced loss, the network is able to achieve significant performance gains on long-tailed datasets.
研究动机与目标
- 激励并建模由于现实世界的长尾分布中的重叠而导致额外数据收益递减的现象。
- 定义有效样本数以量化数据重叠。
- 提出一个对每个类别的有效样本数的倒数成比例的损失重加权项。
- 展示类平衡损失可应用于 softmax、sigmoid 和 focal 损失,适用于各数据集。
提出的方法
- 定义有效样本数 E_n = (1 - beta^n) / (1 - beta),其中 beta ∈ [0,1)。
- 假设数据集规模为 N,并令 beta = (N-1)/N,以计算每个类别的 E_n。
- 引入与 1 / E_{n_i} 成正比的类平衡权重,归一化使和为 C。
- 将类平衡权重应用为 CB 损失:CB = (1 - beta) / (1 - beta^{n_y}) * L(p, y)。
- 推导 softmax 交叉熵、sigmoid 交叉熵,以及 focal 损失的 CB 版本(CB_softmax、CB_sigmoid、CB_focal)。
- 指出 CB_focal 对应将 focal loss 中的 alpha_t 设置为 (1 - beta)/(1 - beta^{n_y})。
实验结果
研究问题
- RQ1如何定义有效样本数以捕捉长尾分布中的数据重叠?
- RQ2是否能通过对有效样本数的倒数进行重加权来提升性能,相对于逆类别频率?
- RQ3提出的类平衡损失是否与基础损失函数无关,并可适用于 softmax、sigmoid 和 focal 损失?
- RQ4在人工长尾 CIFAR 及像 ImageNet、iNaturalist 这样的真实世界大规模数据集上,CB 损失能带来哪些增益?
主要发现
- CB 损失在跨越不同损失函数的长尾数据集上实现显著性能提升。
- CB_softmax、CB_sigmoid 和 CB_focal 在长尾 CIFAR 实验中优于各自的非平衡对照。
- beta ~ 0.999 且 gamma 约在 0.5–2.0 的 CB_focal 在 iNaturalist 和 ImageNet 上取得强劲结果。
- 在大规模数据上,CB_focal 在 iNaturalist 上明显优于 softmax 交叉熵,并在 ImageNet 上达到或超越基线。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。