Skip to main content
QUICK REVIEW

[论文解读] How to be Fair and Diverse?

L. Elisa Celis, Amit Deshpande|arXiv (Cornell University)|Oct 23, 2016
Adversarial Robustness in Machine Learning参考文献 14被引用 39
一句话总结

本文提出P-DPP,一种新颖的算法框架,通过确定性点过程(determinantal point processes)联合优化几何多样性(geometric diversity)与组合公平性(combinatorial fairness),实现对特征空间具有代表性且在受保护属性上保持平衡的子样本。实验表明,P-DPP在几乎不损失几何多样性的情况下显著提升了公平性,即使在数据存在偏差的情况下,也能有效平衡两项目标。

ABSTRACT

Due to the recent cases of algorithmic bias in data-driven decision-making, machine learning methods are being put under the microscope in order to understand the root cause of these biases and how to correct them. Here, we consider a basic algorithmic task that is central in machine learning: subsampling from a large data set. Subsamples are used both as an end-goal in data summarization (where fairness could either be a legal, political or moral requirement) and to train algorithms (where biases in the samples are often a source of bias in the resulting model). Consequently, there is a growing effort to modify either the subsampling methods or the algorithms themselves in order to ensure fairness. However, in doing so, a question that seems to be overlooked is whether it is possible to produce fair subsamples that are also adequately representative of the feature space of the data set - an important and classic requirement in machine learning. Can diversity and fairness be simultaneously ensured? We start by noting that, in some applications, guaranteeing one does not necessarily guarantee the other, and a new approach is required. Subsequently, we present an algorithmic framework which allows us to produce both fair and diverse samples. Our experimental results on an image summarization task show marked improvements in fairness without compromising feature diversity by much, giving us the best of both the worlds.

研究动机与目标

  • 解决现有方法仅优化公平性或多样性之一,而无法同时兼顾的问题。
  • 开发一种可扩展的算法,在保持特征空间多样性的同时确保敏感属性上的公平性。
  • 评估在数据子采样中公平性与多样性是否能够共存且不产生显著权衡。
  • 证明所提方法在隐藏或存在偏差的底层数据分布下的鲁棒性。

提出的方法

  • 提出P-DPP,作为k-DPP的推广,通过强制执行敏感属性组的精确组大小约束,同时保持几何多样性。
  • 将采样概率定义为由特征向量构成的平行多面体体积的平方,且受限于满足预设各敏感属性组大小的子集。
  • 利用k-DPP的高效采样算法,并将其扩展以处理常数个不相交分区(p = O(1)),确保多项式时间可行性。
  • 通过各敏感属性组的固定计数(|S ∩ Xi| = ki)集成公平性约束,确保代表性均衡。
  • 使用香农熵作为组合多样性的度量(D(⋅)),并以格拉姆矩阵的行列式作为几何多样性的代理指标(G(⋅))。
  • 将该框架应用于带有标注敏感属性(如性别、职业)的图像数据集,与均匀采样、k-DPP和k_i-DPP进行对比。

实验结果

研究问题

  • RQ1在数据子采样中,公平性与几何多样性能否同时实现,还是存在固有的权衡?
  • RQ2施加组级别公平性约束如何影响所选样本的几何多样性?
  • RQ3在隐藏或存在偏差的数据分布下,所提P-DPP方法相较于基线方法(如均匀采样、k-DPP、k_i-DPP)的表现如何?
  • RQ4当敏感属性未完全观测到或数据不平衡时,该方法是否仍保持鲁棒性?

主要发现

  • 在所有实验中,P-DPP在公平性(D(⋅))方面显著优于k-DPP、均匀采样和k_i-DPP,配对t检验的p值均小于0.05。
  • P-DPP保持的几何多样性(G(⋅))与k-DPP相当,且显著高于均匀采样,特征空间覆盖范围未出现明显退化。
  • 在存在隐藏属性的情况下,P-DPP在保持高几何多样性的同时展现出更优的公平性,优于k_i-DPP——后者虽施加了部分约束,但公平性表现仍较差。
  • 在高度偏倚的数据集(男性图像占比10–50%)中,P-DPP维持了高公平性(D(⋅)),而k-DPP的公平性出现急剧下降,表明P-DPP对数据偏差具有鲁棒性。
  • 随着较小组规模的增大,P-DPP与k-DPP在几何多样性上的差距缩小,表明在数据覆盖更充分时,权衡效应逐渐减弱。
  • 总体而言,P-DPP在公平性与多样性之间实现了最佳平衡,证明了两项目标可被有效协同优化。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。