QUICK REVIEW

[논문 리뷰] A Closer Look at AUROC and AUPRC under Class Imbalance

Matthew B. A. McDermott, Haoran Zhang|arXiv (Cornell University)|2024. 01. 11.

Imbalanced Data Classification Techniques인용 수 18

한 줄 요약

논문은 AUPRC가 불균형 설정에서 항상 AUROC보다 우수하다고 볼 수 없으며, 지표들 간의 이론적 관계를 제시하고, 합성 실험과 문헌 검토를 통해 잠재적 공정성 편향을 입증한다.

ABSTRACT

In machine learning (ML), a widespread claim is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for tasks with class imbalance. This paper refutes this notion on two fronts. First, we theoretically characterize the behavior of AUROC and AUPRC in the presence of model mistakes, establishing clearly that AUPRC is not generally superior in cases of class imbalance. We further show that AUPRC can be a harmful metric as it can unduly favor model improvements in subpopulations with more frequent positive labels, heightening algorithmic disparities. Next, we empirically support our theory using experiments on both semi-synthetic and real-world fairness datasets. Prompted by these insights, we conduct a review of over 1.5 million scientific papers to understand the origin of this invalid claim, finding that it is often made without citation, misattributed to papers that do not argue this point, and aggressively over-generalized from source arguments. Our findings represent a dual contribution: a significant technical advancement in understanding the relationship between AUROC and AUPRC and a stark warning about unchecked assumptions in the ML community.

연구 동기 및 목표

AUPRC가 불균형 이진 분류에 대해 AUROC보다 우수하다고 널리 퍼진 주장을 도전한다.
AUROC와 AUPRC 간의 수학적 관계를 형식화한다.
다른 유병률을 가진 하위 모집단 간의 공정성에 미치는 지표 선택의 영향을 검사한다.
AUPRC의 이점으로 지지하는 문헌을 평가하고 잘못된 귀속을 식별한다.

제안 방법

p+, p−, p 를 포함하는 AUROC와 AUPRC 간의 이론적 관계를 입증한다.
원자적 실수를 정의하고 AUROC와 AUPRC가 보정 우선순위를 다르게 하는 방법을 보인다 (정리 1과 2).
합성 실험을 수행하여 정리를 검증하고 AUROC 대 AUPRC에 따른 최적화 하의 하위 모집단별 효과를 보여준다.
자동화 및 수동 분석을 사용한 문헌 검토를 수행하여 불균형 설정에서 AUPRC가 우수하다는 주장에 대한 보급과 지지를 평가한다.

Figure 1 : Atomic mistakes occur when neighboring samples, when ordered by model score, are out-of-order with respect to the classification label. AUROC improves by a constant amount no matter which atomic mistake is corrected; AUPRC improves in descending order with model score due to the dependenc

실험 결과

연구 질문

RQ1이진 분류에서 불균등한 클래스 prevalence일 때 AUROC와 AUPRC가 결정론적으로 관계하는가?
RQ2각 지표가 점수 영역과 하위 모집단 전반에 걸쳐 모델 개선을 어떻게 우선순위로 두는가?
RQ3AUPRC를 위한 최적화가 prevalence가 다른 하위 그룹 간 격차를 초래하는가, AUROC 대비?
RQ4불균형에서 AUPRC가 우수하다는 일반적 믿음이 문헌 전반에서 실증적 증거로 뒷받침되는가?

주요 결과

AUROC와 AUPRC는 형식적 표현을 통해 확률적으로 서로 관련되어 있어 AUPRC가 보편적으로 우수하다는 관념에 도전한다.
AUROC는 false positives를 동일하게 가중시키고 점수 영역 전반에 걸쳐 편향 없이 작동하는 반면, AUPRC는 발생률의 역수로 false positives를 가중시켜 고점수 실수를 우선시한다.
AUPRC 최적화는 고유병률 하위집단을 선호하는 경향이 있어 서로 다른 유병률을 가진 그룹 간 공정성에 해를 끼칠 수 있다.
합성 실험은 AUPRC로 튜닝할 때 하위 모집단 간 격차가 커질 수 있음을 보이고, 반면 AUROC 최적화는 지표를 더 고르게 개선한다.
문헌 검토 결과 불균형 상황에서 AUPRC가 우수하다는 주장에 보급은 넓지만 종종 잘못 귀속되며, 많은 인용이 확실한 지지를 결여한다.

(a) Fixing atomic mistakes to optimize overall AUROC

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.