QUICK REVIEW

[论文解读] Restoring balance: principled under/oversampling of data for optimal classification

Emanuele Loffredo, Mauro Pastore|arXiv (Cornell University)|May 15, 2024

Statistical Methods and Inference被引用 5

一句话总结

本文在类别不平衡条件下推导了高维线性分类器的精确解析泛化曲线，确定了最优的混合欠采样/过采样策略，并在真实数据和深度模型上验证了预测。

ABSTRACT

Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.

研究动机与目标

在高维有监督学习中激发并形式化描述类别不平衡问题。
使用统计力学方法为在不平衡条件下的线性分类器的一般化性能推导出精确的解析表达式。
识别用于最大化性能指标的最优欠采样/过采样策略，包括混合方法。
通过真实数据集的实验以及更深层架构和先进采样方法来验证理论预测。

提出的方法

将训练建模为在球形正则化下的经验风险最小化，使用铰链损失（软边界SVM）。
用一阶与二阶统计量（均值 M、位移 δ、协方差 C）来表征数据，并假设高维极限（L→∞）。
应用复制方法推导出鞍点方程，使性能指标成为数据统计量和不平衡比的函数。
通过测试预激分布 Δ±，在鞍点求解的参数下，推导测试指标（如混淆矩阵、准确率 ACC、平衡准确率 BA、AUC）的精确渐近预测。
分析不平衡对指标的影响并计算欠采样/过采样的最优混合参数（混合百分比）。
通过数值实验将理论扩展到更深的模型，并探索基于无监督RBM的采样（LIS）以及简单的过采样/欠采样策略。

实验结果

研究问题

RQ1在高维极限下，类别不平衡如何影响线性分类器的一般化性能？
RQ2哪种采样策略（欠采样、过采样或混合）在不同性能指标下能最好地缓解不平衡？
RQ3在现实数据统计下，混合欠采样/过采样是否优于纯欠采样或纯过采样？
RQ4理论预测是否在更深层的结构和更复杂的采样方法下成立？

主要发现

在类别不平衡下，AUC 相对不敏感，而 BA 更具信息量且偏好平衡的训练。
在不平衡条件下，最佳的一般化性能往往需要混合欠采样/过采样，而非纯黏欠采样或过采样。
完全欠采样效果较差；混合策略在多种情景下提高了平衡准确率 (BA)。
基于RBM的似然信息采样（LIS）在线性SVM和类似MNIST的任务中相对于随机采样提升性能。
平衡训练提升深度分类器性能（例如在二值化的CIFAR-10上微调的ResNet-50），并产生更清晰的决策边界。
在合理的协方差假设下，理论定量预测基准数据集（MNIST变体、CelebA）的 BA 曲线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。