QUICK REVIEW

[论文解读] Scalable Fair Clustering

Artūrs Bačkurs, Piotr Indyk|arXiv (Cornell University)|Feb 10, 2019

Facility Location and Emergency Management被引用 58

一句话总结

本文提出一种近线性时间算法，用于 (r,b)-公正 k-median 聚类，先通过 gamma-HST 嵌入计算可扩展的 (r,b)-公正小集分解，再将公正小集集合并成 k 个簇。

ABSTRACT

We study the fair variant of the classic $k$-median problem introduced by Chierichetti et al. [2017]. In the standard $k$-median problem, given an input pointset $P$, the goal is to find $k$ centers $C$ and assign each input point to one of the centers in $C$ such that the average distance of points to their cluster center is minimized. In the fair variant of $k$-median, the points are colored, and the goal is to minimize the same average distance objective while ensuring that all clusters have an "approximately equal" number of points of each color. Chierichetti et al. proposed a two-phase algorithm for fair $k$-clustering. In the first step, the pointset is partitioned into subsets called fairlets that satisfy the fairness requirement and approximately preserve the $k$-median objective. In the second step, fairlets are merged into $k$ clusters by one of the existing $k$-median algorithms. The running time of this algorithm is dominated by the first step, which takes super-quadratic time. In this paper, we present a practical approximate fairlet decomposition algorithm that runs in nearly linear time. Our algorithm additionally allows for finer control over the balance of resulting clusters than the original work. We complement our theoretical bounds with empirical evaluation.

研究动机与目标

通过设计近线性时间的公正小集分解，解决公正 k-median 聚类的可扩展性差距。
开发一种基于嵌入的方法（gamma-HST），以实现高效的公正小集构造。
提供关于近似因子的理论保证，并在标准数据集上展示经验可扩展性。
实现对簇平衡（r,b）的更细粒度控制，超越以往工作。

提出的方法

将输入嵌入到 gamma-HST，使用基于随机网格的构造。
在 HST 上计算一个 (r,b)-公正小集分解，以近似公正 k-median 目标，代价失真为 O(d*(r^8+b^8)*log n)。
采用自上而下的树划分方法以最小化“heavy points”，并获得近线性时间的公正小集分解（MinHeavyPoints, UnbalancedPoints, NonSaturFairlet, ExtraPoint）。
通过将每个公正小集转换为中心并对复制的中心运行一个 beta-近似的 k-median，将公正小集聚合为 k 个簇，从而得到 (alpha + (r+b)*beta)-近似的 (r,b)-公正 k-median。
理论保证：时间复杂度为 O(d * n * log n + T(n,d,k))，代价在最佳值的 O_r,b(d*log n + alpha) 内，当使用 HST 嵌入时总体接近线性时间。

实验结果

研究问题

RQ1在欧几里得空间中，如何在不进行二次时间的公正小集计算的情况下实现可扩展的 (r,b)-公正 k-median 聚类？
RQ2基于嵌入的方法（gamma-HST）是否能在保留公正性约束的同时实现近线性公正小集分解？
RQ3可扩展的基于公正小集的公正聚类流水线的近似保证是什么？

主要发现

所提方法得到的 (r,b)-公正 k-median 聚类的代价在最优公正代价的 O_r,b(d*log n + alpha) 之内。
公正小集分解阶段以近线性时间运行，主要由嵌入和对 HST 的线性时间处理主导。
实证结果表明，聚类质量与先前方法（Chierichetti 等，2017）相当，且在大规模数据集上实现近线性可扩展性。
该算法可扩展到大规模数据集，并且比原始公正小集方法在簇平衡控制上提供更细粒度的控制。
在实验中，该方法在公正小集计算方面显示出显著的加速，同时保持具有竞争力的目标值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。