QUICK REVIEW

[论文解读] A Closer Look at AUROC and AUPRC under Class Imbalance

Matthew B. A. McDermott, Haoran Zhang|arXiv (Cornell University)|Jan 11, 2024

Imbalanced Data Classification Techniques被引用 18

一句话总结

本文主张在不平衡设置下 AUPRC 并非对 AUROC 全局优越，提供指标之间的理论关系，并通过合成实验和文献综述展示潜在的公平性偏差。

ABSTRACT

In machine learning (ML), a widespread claim is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for tasks with class imbalance. This paper refutes this notion on two fronts. First, we theoretically characterize the behavior of AUROC and AUPRC in the presence of model mistakes, establishing clearly that AUPRC is not generally superior in cases of class imbalance. We further show that AUPRC can be a harmful metric as it can unduly favor model improvements in subpopulations with more frequent positive labels, heightening algorithmic disparities. Next, we empirically support our theory using experiments on both semi-synthetic and real-world fairness datasets. Prompted by these insights, we conduct a review of over 1.5 million scientific papers to understand the origin of this invalid claim, finding that it is often made without citation, misattributed to papers that do not argue this point, and aggressively over-generalized from source arguments. Our findings represent a dual contribution: a significant technical advancement in understanding the relationship between AUROC and AUPRC and a stark warning about unchecked assumptions in the ML community.

研究动机与目标

挑战广泛持有的观点，即在不平衡的二分类中 AUPRC 优于 AUROC。
形式化 AUROC 与 AUPRC 之间的数学关系。
检验度量选择如何影响在不同患病率的子群体中的公平性。
评估支持 AUPRC 假定优势的文献并识别错误归因。

提出的方法

证明涉及分布 p+、p− 和 p 的 AUROC 与 AUPRC 之间的理论关系。
定义原子错误并展示 AUROC 与 AUPRC 在纠正优先级上的不同（定理1 与定理2）。
开展合成实验以验证定理并展示在按 AUROC 与 AUPRC 优化时对各子群体的效应。
结合自动化与人工分析进行文献综述，以评估在不平衡设置下 AUPRC 处于优势地位的主张的盛行程度与支持证据。

Figure 1 : Atomic mistakes occur when neighboring samples, when ordered by model score, are out-of-order with respect to the classification label. AUROC improves by a constant amount no matter which atomic mistake is corrected; AUPRC improves in descending order with model score due to the dependenc

实验结果

研究问题

RQ1在二分类、类别患病率不相等的情况下，AUROC 与 AUPRC 是否确定地相关？
RQ2每个指标如何在分数区域与子群体上优先考虑模型改进？
RQ3将优化目标设为 AUPRC 是否会导致不同患病率子群体之间出现不平等，相较于优化 AUROC？
RQ4在文献中，普遍认为不平衡下 AUPRC 优于 AUROC 的观点是否有经验证据支持？

主要发现

AUROC 与 AUPRC 通过一个形式化表达在概率上相关，这挑战了 AUPRC 全局优越的说法。
AUROC 对假阳性等权重，在分数区域上是无偏的，而 AUPRC 通过假阳性按反向触发率加权，优先关注高分错误。
在 AUPRC 下的优化往往偏向高患病率子群体，可能损害跨不同患病率群体的公平性。
合成实验表明用 AUPRC 调优可能增加子群体之间的差异，而 AUROC 优化则让指标更均衡地提升。
深入的文献综述显示在不平衡设置下 AUPRC 优于的主张广泛存在但常被错误归因，且许多引文缺乏可靠支持。

(a) Fixing atomic mistakes to optimize overall AUROC

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。