QUICK REVIEW

[论文解读] Identifying and Compensating for Feature Deviation in Imbalanced Deep Learning

Han-Jia Ye, Hong-You Chen|arXiv (Cornell University)|Jan 6, 2020

Imbalanced Data Classification Techniques参考文献 62被引用 55

一句话总结

该论文识别出在不均衡的深度学习中导致对小类的过拟合的特征偏差现象，并提出类相关温度（CDT）以在训练中进行补偿，从而在基准测试中提升小类的表现。

ABSTRACT

Classifiers trained with class-imbalanced data are known to perform poorly on test data of the "minor" classes, of which we have insufficient training data. In this paper, we investigate learning a ConvNet classifier under such a scenario. We found that a ConvNet significantly over-fits the minor classes, which is quite opposite to traditional machine learning algorithms that often under-fit minor classes. We conducted a series of analysis and discovered the feature deviation phenomenon -- the learned ConvNet generates deviated features between the training and test data of minor classes -- which explains how over-fitting happens. To compensate for the effect of feature deviation which pushes test data toward low decision value regions, we propose to incorporate class-dependent temperatures (CDT) in training a ConvNet. CDT simulates feature deviation in the training phase, forcing the ConvNet to enlarge the decision values for minor-class data so that it can overcome real feature deviation in the test phase. We validate our approach on benchmark datasets and achieve promising performance. We hope that our insights can inspire new ways of thinking in resolving class-imbalanced deep learning.

研究动机与目标

理解在长尾分布下 ConvNets 为何对小类过拟合。
描述小类训练数据与测试数据之间的特征偏差。
评估重加权和重采样在解决 ConvNets 不均衡学习中的局限性。
提出并验证一种训练策略（CDT），在不降低偏差的情况下对特征偏差进行补偿。

提出的方法

通过经验分析不平衡数据上的 ConvNet 行为，观察小类训练与测试特征之间的偏差。
将分类器分解为 ŷ = arg max_c w_c^T f_theta(x) 并研究小类在特征空间中的偏离。
通过类级训练和测试特征均值之间的距离来量化特征偏差（方程式 4）。
引入类相关温度 a_c 来修改训练目标，有效地增大小类的判定值。
定义 a_c = (N_max / N_c)^gamma，gamma >= 0，用以控制补偿程度，并使用修改后的交叉熵训练（方程式 5）。
在 CIFAR-10/100、Tiny-ImageNet 和 iNaturalist 上评估 CDT，varying imbalance ratios，并与 ERM、重采样和重加权基线比较。

实验结果

研究问题

RQ1在使用 ConvNets 的不均衡深度学习中，导致对小类性能不佳的原因是什么？
RQ2训练数据与测试数据之间的特征偏差是否能解释对小类的过拟合？
RQ3通过将训练目标调整为模拟特征偏差（CDT）是否能在不降低偏差本身的情况下改善测试性能？
RQ4在标准不均衡基准上，CDT 与重采样和重加权相比的表现如何？

主要发现

端到端在不平衡数据上训练的 ConvNets 相对于一些传统方法对小类过拟合。
小类特征在训练集与测试集之间分歧，且偏差随着类别频率下降而增大。
朴素的重采样和重加权并不能减少特征偏差，可能无法改善甚至恶化小类表现。
类相关温度（CDT）通过在训练期间增大小类的判定值来补偿特征偏差，从而提高测试准确率。
CDT 在若干基准数据集（CIFAR-10/100、Tiny-ImageNet、iNaturalist）上实现优越或有竞争力的表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。