QUICK REVIEW

[论文解读] Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

Ruoxi Jia, David Dao|arXiv (Cornell University)|Jul 1, 2019

Advanced Image and Video Retrieval Techniques参考文献 13被引用 8

一句话总结

本文提出了用于计算K近邻（KNN）模型中基于Shapley值的数据估值的精确与近似算法，实现了O(N log N)的精确计算以及通过局部敏感哈希（LSH）加速的亚线性O(N h(ϵ,K) log N)近似计算。关键贡献在于相比基线方法实现了指数级的性能提升，使得大规模场景下的高效、公平数据估值成为可能——在高达1000万个数据点的基准数据集上进行了验证。

ABSTRACT

Given a data set $\mathcal{D}$ containing millions of data points and a data consumer who is willing to pay for \$$X$ to train a machine learning (ML) model over $\mathcal{D}$, how should we distribute this \$$X$ to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all $N$ data points, it requires $O(2^N)$ model evaluations for exact computation and $O(N\log N)$ for $(ε, δ)$-approximation. In this paper, we focus on one popular family of ML models relying on $K$-nearest neighbors ($K$NN). The most surprising result is that for unweighted $K$NN classifiers and regressors, the Shapley value of all $N$ data points can be computed, exactly, in $O(N\log N)$ time -- an exponential improvement on computational complexity! Moreover, for $(ε, δ)$-approximation, we are able to develop an algorithm based on Locality Sensitive Hashing (LSH) with only sublinear complexity $O(N^{h(ε,K)}\log N)$ when $ε$ is not too small and $K$ is not too large. We empirically evaluate our algorithms on up to $10$ million data points and even our exact algorithm is up to three orders of magnitude faster than the baseline approximation algorithm. The LSH-based approximation algorithm can accelerate the value calculation process even further. We then extend our algorithms to other scenarios such as (1) weighed $K$NN classifiers, (2) different data points are clustered by different data curators, and (3) there are data analysts providing computation who also requires proper valuation.

研究动机与目标

为解决大规模机器学习市场中数据贡献者期望获得收益分成的公平、可扩展数据估值挑战。
克服KNN模型中效用函数的精确Shapley值计算带来的指数级计算成本。
开发实用且高效的算法，为无权重和加权KNN分类器与回归器提供具有可证明保证的数据估值计算。
将数据估值扩展至每位贡献者提供多个数据点，以及计算贡献估值的场景。

提出的方法

使用Shapley值（SV）定义数据价值，确保收益分配的公平性、合理性与去中心化。
通过利用最近邻的几何特性与基于排序的聚合方法，提出一种针对无权重KNN分类器的精确O(N log N)算法。
提出一种基于LSH的近似方法，实现亚线性复杂度O(N h(ϵ,K) log N)，适用于大规模数据集，其中当K* = max{1/ϵ, K} < C时，h(ϵ,K) < 1。
将框架扩展至加权KNN、每位贡献者提供多个数据点以及计算贡献估值，采用蒙特卡洛近似方法。
提出一种新颖的蒙特卡洛近似方法，复杂度为O(N (log N)^2 / (log K)^2)，显著快于基线采样方法。
利用KNN中的局部性与对称性，减少冗余的效用评估，实现高效的边际贡献估计。

实验结果

研究问题

RQ1能否在亚指数时间内精确计算KNN模型中所有数据点的Shapley值？
RQ2能否利用LSH实现在(ϵ, δ)近似意义下KNN数据估值的亚线性时间复杂度？
RQ3如何将高效的数值估值方法扩展至加权KNN及数据提供者贡献多个数据点的场景？
RQ4能否在协作式机器学习环境中高效估值数据与计算贡献？
RQ5所提算法相比基线近似方法在理论与实证性能上分别有多大的提升？

主要发现

无权重KNN分类器的精确Shapley值计算时间复杂度为O(N log N)，相比标准的O(2^N)复杂度实现了指数级提升。
基于LSH的近似方法实现了亚线性复杂度O(N h(ϵ,K) log N)，使得处理高达1000万个数据点的数据集成为可能。
在实证评估中，精确算法相比基线近似方法最快可快达三个数量级。
对于加权KNN，精确算法的时间复杂度为O(NK)，仍为指数级，但通过蒙特卡洛近似方法可实现O(N (log N)^2 / (log K)^2)倍于基线的加速。
所提算法在显著降低运行时间的同时，仍保持对近似误差(ϵ, δ)的理论保证，使基于Shapley值的估值在大规模场景下具备实用性。
实证结果表明，基于LSH的近似方法进一步加速了计算，尤其在ϵ不过小且K适中时效果更显著。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。