QUICK REVIEW

[论文解读] Fairness and Missing Values

Fernando Martínez‐Plumed, Cèsar Ferri|arXiv (Cornell University)|May 29, 2019

Ethics and Social Impacts of AI被引用 6

一句话总结

本文主张缺失数据与机器学习中的公平性密切相关，挑战了在未考虑公平性影响的情况下直接删除或填补缺失值的常见做法。研究发现，含有缺失值的样本通常比完整样本更公平，且与删除法相比，填补法在随机森林等模型中能实现更优的公平性-性能权衡。

ABSTRACT

The causes underlying unfair decision making are complex, being internalised in different ways by decision makers, other actors dealing with data and models, and ultimately by the individuals being affected by these decisions. One frequent manifestation of all these latent causes arises in the form of missing values: protected groups are more reluctant to give information that could be used against them, delicate information for some groups can be erased by human operators, or data acquisition may simply be less complete and systematic for minority groups. As a result, missing values and bias in data are two phenomena that are tightly coupled. However, most recent techniques, libraries and experimental results dealing with fairness in machine learning have simply ignored missing data. In this paper, we claim that fairness research should not miss the opportunity to deal properly with missing data. To support this claim, (1) we analyse the sources of missing data and bias, and we map the common causes, (2) we find that rows containing missing values are usually fairer than the rest, which should not be treated as the uncomfortable ugly data that different techniques and libraries get rid of at the first occasion, and (3) we study the trade-off between performance and fairness when the rows with missing values are used (either because the technique deals with them directly or by imputation methods). We end the paper with a series of recommended procedures about what to do with missing data when aiming for fair decision making.

研究动机与目标

探究缺失数据与算法公平性之间的关系。
挑战在未考虑公平性的情况下默认删除或填补缺失值的做法。
评估含有缺失值的样本是否比完整样本更公平或更不公平。
分析在使用填补与删除策略时，公平性与性能之间的权衡。
为公平性敏感的机器学习中的缺失数据处理提供可操作的建议。

提出的方法

分析三个具有已知公平性问题和缺失值的真实数据集（Adult、Recidivism、Titanic）。
将缺失数据的来源映射到偏见的根本原因，包括隐私顾虑和系统性代表性不足。
使用统计差异（SPD）作为主要公平性度量指标，比较不同数据子集间的公平性。
在删除和填补后的数据集上应用多种模型（DT、LR、NN、RF、SV），以评估公平性与性能的权衡。
构建帕累托前沿，可视化不同填补与删除策略下准确率与公平性之间的权衡。
推导公平性-性能空间的理论边界八边形，以定位实证结果。

实验结果

研究问题

RQ1缺失值与公平性是否存在因果关联？若存在，其机制如何？
RQ2与完整样本相比，含有缺失值的样本是否表现出更高或更低的公平性？
RQ3删除含有缺失值的样本是否会加剧偏见？填补是否能缓解或反而加剧偏见？
RQ4不同填补方法如何影响预测模型中的公平性-性能权衡？
RQ5在公平性关键的机器学习应用中，处理缺失数据的推荐做法是什么？

主要发现

在Adult、Recidivism和Titanic数据集中，含有缺失值的样本在统计差异（SPD）方面始终比完整样本更公平。
删除含有缺失值的样本系统性地恶化了公平性，尤其在缺失机制非随机（如缺失与受保护属性相关）的数据集中更为明显。
与删除法相比，填补通常能保持或改善公平性，其中随机森林在准确率与公平性之间展现出最有利的权衡。
在Adult数据集中，所有填补方法相较于理想模型均降低了偏见，表明填补有助于缓解不公平性。
由填补数据构建的帕累托前沿优于由删除数据构建的前沿，表明填补能提供更广泛的可行公平性-性能折中方案。
随机森林在公平性-性能空间中从删除法到理想模型呈现出近乎线性的路径，表明在使用填补时具有强鲁棒性与稳定性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。