QUICK REVIEW

[论文解读] Comparative Separation: Evaluating Separation on Comparative Judgment Test Data

Xiaoyin Xi, Neeku Capak|arXiv (Cornell University)|Jan 11, 2026

Ethics and Social Impacts of AI被引用 0

一句话总结

该论文定义了比较分离，证明其等价于二分类中的分离，并开发统计检验和功效分析以使用比较判断来评估公平性。通过仿真和真实数据集验证理论。

ABSTRACT

This research seeks to benefit the software engineering society by proposing comparative separation, a novel group fairness notion to evaluate the fairness of machine learning software on comparative judgment test data. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. It is the responsibility of all software developers to make their software accountable by ensuring that the machine learning software do not perform differently on different sensitive groups -- satisfying the separation criterion. However, evaluation of separation requires ground truth labels for each test data point. This motivates our work on analyzing whether separation can be evaluated on comparative judgment test data. Instead of asking humans to provide the ratings or categorical labels on each test data point, comparative judgments are made between pairs of data points such as A is better than B. According to the law of comparative judgment, providing such comparative judgments yields a lower cognitive burden for humans than providing ratings or categorical labels. This work first defines the novel fairness notion comparative separation on comparative judgment test data, and the metrics to evaluate comparative separation. Then, both theoretically and empirically, we show that in binary classification problems, comparative separation is equivalent to separation. Lastly, we analyze the number of test data points and test data pairs required to achieve the same level of statistical power in the evaluation of separation and comparative separation, respectively. This work is the first to explore fairness evaluation on comparative judgment test data. It shows the feasibility and the practical benefits of using comparative judgment test data for model evaluations.

研究动机与目标

在标签昂贵或不可靠时，说明在 ML 中进行公平性评估的必要性。
将比较分离引入为一个在比较判断基础上的公平性概念。
证明比较分离与二分类中的分离在理论上的等价性。
开发用于评估分离与比较分离的假设检验与功效分析。

提出的方法

在数据点两两比较判断上定义比较分离。
建立等价性：若满足比较分离则在二分类的标准分离成立（定理3.3）。
提出用于利用两两数据的分离与比较分离的度量与统计检验（TPR及相关量）。
给出功效分析，表明在二分类情景中实现相同统计功效所需的比较判断测试对数量大约是分离的两倍。
通过比较判断将评估框架扩展到分类与回归场景。
通过仿真和软件工程领域的真实公平性数据集验证结果。

实验结果

研究问题

RQ1RQ1：比较分离是否等价于二分类中的分离？
RQ2RQ2：如何在统计上检验二分类分类器是否满足分离或比较分离？
RQ3RQ3：为了达到期望的统计功效，需要多少测试数据点或对来实现分离和比较分离？

主要发现

比较分离在理论上等价于二分类中的分离（定理3.3）。
分离与比较分离的统计检验依赖于两个原假设，在 α = 0.05 时共享的第一类错误率为 0.0975。
为了在二分类中达到相同的统计功效，比较分离大约需要分离测试数据对数量的两倍（第3.4.2节）。
论文给出功效分析公式和命题，用以估计第二类错误率和所需样本量（命题3.4和3.5）。
通过仿真和真实世界的公平性数据集的经验验证支持理论结果，并展示了使用比较判断进行公平性评估的可行性。
实验的代码与数据公开在 GitHub（https://github.com/hil-se/Comparative_Separation）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。