QUICK REVIEW

[论文解读] Improved Approximation and Scalability for Fair Max-Min Diversification

Raghavendra Addanki, Andrew McGregor|arXiv (Cornell University)|Jan 1, 2022

Privacy-Preserving Technologies in Data被引用 6

一句话总结

本文针对度量空间和欧几里得空间中的公平最大最小多样化问题，提出了改进的近似算法，在一般度量空间中实现2-近似，在欧几里得空间中实现(1+ϵ)-近似，且公平性接近最优。该工作引入了随机舍入、核心集构造以及流式/分布式算法，显著优于先前的3m−1近似，并实现了大规模数据集的可扩展处理。

ABSTRACT

Given an $n$-point metric space $(\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-ε$ for any constant $ε$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $ε>0$, we present a $1+ε$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+\ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-ε) k_i$ points from category $i\in [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.

研究动机与目标

解决大规模多类别数据集中对平衡、多样化采样的需求。
在公平最大最小多样化问题上，将近似因子改进至超越先前3m−1界限。
通过流式与分布式算法实现大规模数据集的可扩展性。
在具有有界加倍维数的欧几里得空间中，实现接近最优的(1+ϵ)-近似。
构建可组合核心集与数据流算法，以支持高效分布式与单遍处理。

提出的方法

通过线性规划的随机舍入，实现一般度量空间中公平性在期望下成立的2-近似。
提出一种6-近似算法，在温和条件下以高概率保证每个组i的(1−ϵ)ki公平性。
提出一种线性时间m+1-近似算法，实现精确公平性，显著优于先前的3m−1结果。
基于几何聚类与网格分解，设计一种新颖的欧几里得度量核心集构造方法。
利用动态规划在核心集上设计一种(1+ϵ)-近似算法，适用于常数维欧几里得空间。
通过基于阈值的贪心选择(τ-GMM)与核心集复用，构建可组合核心集与单遍数据流算法。

实验结果

研究问题

RQ1在一般度量空间中，能否实现公平最大最小多样化问题的近似因子优于3m−1？
RQ2在欧几里得空间中，能否在实现近似公平性与高效运行时间的前提下，达到(1+ϵ)-近似？
RQ3能否设计适用于大规模数据集的可扩展算法，支持单遍处理与分布式计算？
RQ4在公平多样化问题中，近似质量与公平性松弛之间的最优权衡是什么？
RQ5能否构建在分布式处理下仍保持公平性与近似保证的可组合核心集？

主要发现

本文提出一种在一般度量空间中公平性在期望下成立的2-近似算法，优于先前的3m−1界限。
在条件ki = Ω(ϵ−2 log m)下，实现6-近似，且每个组i的公平性为(1−ϵ)ki，以高概率成立。
提出一种线性时间m+1-近似算法，实现精确公平性，显著优于先前的3m−1结果。
对于常数维欧几里得空间，算法在O(nk) + 2O(k)时间内实现(1+ϵ)-近似，且存在一种改进的O(nk) + poly(k)变体，代价为(1−ϵ)ki公平性。
为具有加倍维数λ的欧几里得度量构造了大小为O((8/ϵ)λkmL)的(1+ϵ)-可组合核心集，支持分布式处理。
设计了单遍数据流算法，一般度量空间下空间复杂度为O(ϵ−1km log n)，欧几里得空间下为O((8/ϵ)λkmϵ−1 log n)，分别实现30(1+ϵ)-和(1+ϵ)-近似。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。