QUICK REVIEW

[论文解读] The Fairness of Risk Scores Beyond Classification: Bipartite Ranking and the XAUC Metric

Nathan Kallus, Angela Zhou|arXiv (Cornell University)|Jan 1, 2019

Law, Economics, and Judicial Systems被引用 14

一句话总结

本文提出了 xAUC 差异，一种用于在二元分类之外评估预测风险评分公平性的度量方法，通过将风险评分建模为二部排序任务。该方法将排序损失分解为群体特定的预测性能与差异性成分，揭示了传统度量方法所忽略的再犯、收入和心脏骤停预测中的隐藏公平性问题。

ABSTRACT

Where machine-learned predictive risk scores inform high-stakes decisions, such as bail and sentencing in criminal justice, fairness has been a serious concern. Recent work has characterized the disparate impact that such risk scores can have when used for a binary classification task. This may not account, however, for the more diverse downstream uses of risk scores and their non-binary nature. To better account for this, in this paper, we investigate the fairness of predictive risk scores from the point of view of a bipartite ranking task, where one seeks to rank positive examples higher than negative ones. We introduce the xAUC disparity as a metric to assess the disparate impact of risk scores and define it as the difference in the probabilities of ranking a random positive example from one protected group above a negative one from another group and vice versa. We provide a decomposition of bipartite ranking loss into components that involve the discrepancy and components that involve pure predictive ability within each group. We use xAUC analysis to audit predictive risk scores for recidivism prediction, income prediction, and cardiac arrest prediction, where it describes disparities that are not evident from simply comparing within-group predictive performance.

研究动机与目标

解决在风险评分用于二元分类任务之外时，公平性评估的局限性。
通过二部排序框架建模风险评分的公平性，其中正样本应排在负样本之前。
开发一种新的公平性度量 xAUC 差异，以量化不同受保护群体之间正确排序概率的差异。
将二部排序损失分解为反映组内预测性能和组间差异性的两个组成部分。
使用 xAUC 分析对再犯预测、收入预测和心脏骤停预测中的真实世界风险评分进行审计，以发现传统组内度量方法无法察觉的公平性问题。

提出的方法

提出 xAUC 差异作为从一个受保护群体中正确将正样本排在另一群体负样本之上的概率，与相反情况的概率之间的差异。
采用概率公式定义 xAUC 差异，比较不同群体之间正确相对排序的可能性。
将二部排序损失分解为两部分：一部分捕捉每个群体内的预测能力，另一部分衡量群体间排序表现的差异。
将 xAUC 框架应用于三个领域中的风险评分审计：再犯预测、收入预测和心脏骤停预测。
通过经验分析将 xAUC 差异与标准组内性能度量进行比较，突出显示传统方法无法察觉的差异。
采用统计分解方法，将群体不平衡对排序表现的影响与模型校准或歧视差异的影响区分开来。

实验结果

研究问题

RQ1当通过二部排序视角而非二元分类视角评估时，风险评分的公平性如何体现？
RQ2现有公平性度量在多大程度上未能捕捉到基于排序决策中出现的差异？
RQ3xAUC 差异能否有效检测出在组内预测性能中不明显的公平性问题？
RQ4在现实世界的风险预测任务中，排序损失的组成部分——预测能力与群体差异——如何不同地发挥作用？
RQ5xAUC 分析揭示了再犯、收入和心脏骤停预测中哪些公平性信息是传统度量所忽略的？

主要发现

xAUC 差异在再犯预测中成功识别出组内预测性能度量无法检测到的公平性差异。
在收入预测中，xAUC 分析揭示了即使组内 AUC 值相似，不同人口群体之间仍存在显著的排序不平衡。
在心脏骤停预测中，该度量揭示了某些受保护群体存在系统性的排序劣势，尽管整体模型性能可接受。
排序损失的分解表明，群体差异成分对总体损失有显著贡献，表明公平性问题并非仅由预测能力低下引起。
与传统的基于分类的公平性度量相比，xAUC 差异度量提供了更细致、更全面的公平性评估。
实证结果表明，具有相似组内 AUC 的风险评分在跨群体排序公平性方面可能存在显著差异，凸显了开发替代评估框架的必要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。