QUICK REVIEW

[论文解读] Data Shapley: Equitable Valuation of Data for Machine Learning

Amirata Ghorbani, James Zou|arXiv (Cornell University)|Apr 5, 2019

Explainable Artificial Intelligence (XAI)被引用 152

一句话总结

Data Shapley 为监督学习中每个训练数据点提供一个公平的、博弈论的价值，通过蒙特卡洛方法估计，适用于各种模型和任务。

ABSTRACT

As data becomes the fuel driving technological and economic growth, a fundamental challenge is how to quantify the value of data in algorithmic predictions and decisions. For example, in healthcare and consumer markets, it has been suggested that individuals should be compensated for the data that they generate, but it is not clear what is an equitable valuation for individual data. In this work, we develop a principled framework to address data valuation in the context of supervised machine learning. Given a learning algorithm trained on $n$ data points to produce a predictor, we propose data Shapley as a metric to quantify the value of each training datum to the predictor performance. Data Shapley value uniquely satisfies several natural properties of equitable data valuation. We develop Monte Carlo and gradient-based methods to efficiently estimate data Shapley values in practical settings where complex learning algorithms, including neural networks, are trained on large datasets. In addition to being equitable, extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor.

研究动机与目标

激发在监督学习中建立一个公平的数据估值框架的需求。
将 Data Shapley 定义为相对于一个学习算法和性能指标的每个训练数据点的公平价值。
提出在实际场景中估计 Data Shapley 值的计算方法。
展示 Data Shapley 在数据质量评估、领域自适应和数据获取决策中的应用。

提出的方法

将数据估值公式化为一个合作博弈，其中训练数据源作为参与者，结果是模型性能 V(D, A)。
将 Data Shapley 值推导为唯一满足三个性质的价值分配：对从不改变性能的数据点赋予零值、在贡献相等时对称、以及在性能分数上的可加性（Eqn. 1）。
使用对随机排列的蒙特卡洛采样来估计数据点对 V 的边际贡献，从而得到一个类似 Shapley 的估计量。
引入截断蒙特卡洛 Shapley（TMC-Shapley），通过在排列扫描过程中剪除微小边际贡献来降低计算量。
提供一种针对特定学习算法的第二近似方法（具体见附录 B）。
讨论 Data Shapley 在识别数据质量、通过加权损失引导领域自适配、以及为数据获取决策提供信息方面的应用。

实验结果

研究问题

RQ1在监督学习中，对于选定的性能指标，每个训练数据点的公平统计价值是什么？
RQ2如何在大规模数据集和复杂模型中高效估计 Data Shapley 值？
RQ3Data Shapley 值是否能揭示数据质量、帮助领域自适应并指导数据获取？
RQ4Data Shapley 与逐个移除法或基于杠杆的度量在识别有价值或有害数据方面有何比较？
RQ5将 Data Shapley 应用于现实世界的生物医学与图像数据集的实际意义和局限性是什么？

主要发现

Data Shapley 为训练数据提供一个公平的估值框架，符合三个自然的公平性属性。
在实验中，Data Shapley 比 leave-one-out 或杠杆分数更有效地识别有价值的数据。
低 Shapley 值数据往往捕捉离群点或损坏数据，而高 Shapley 值数据表示有助于改进预测。
Data Shapley 可通过优先获取与高价值数据相似的样本来引导数据获取，并可用于对训练数据重新加权以进行领域自适应。
该框架支持实际应用，包括医疗保健数据估值、图像数据质量评估和跨中心的领域自适配。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。