QUICK REVIEW

[论文解读] Fully Scalable MPC Algorithms for Clustering in High Dimension

Artur Czumaj, Guichen Gao|arXiv (Cornell University)|Jul 15, 2023

Data Management and Algorithms被引用 1

一句话总结

本文提出了首个在高维欧几里得空间中实现完全可扩展的并行计算模型（MPC）聚类算法，对设施位置问题、k-中位数和k-均值问题均实现了O(1)-近似解，且仅需O(1)轮计算。该方法基于一种新颖的几何聚合原语，利用一致哈希技术实现高维空间中高效邻域统计（如范围计数、最近邻）计算，使得局部内存大小可小至n^σ（其中σ > 0为任意常数）。

ABSTRACT

We design new parallel algorithms for clustering in high-dimensional Euclidean spaces. These algorithms run in the Massively Parallel Computation (MPC) model, and are fully scalable, meaning that the local memory in each machine may be $n^σ$ for arbitrarily small fixed $σ>0$. Importantly, the local memory may be substantially smaller than the number of clusters $k$, yet all our algorithms are fast, i.e., run in $O(1)$ rounds. We first devise a fast MPC algorithm for $O(1)$-approximation of uniform facility location. This is the first fully-scalable MPC algorithm that achieves $O(1)$-approximation for any clustering problem in general geometric setting; previous algorithms only provide $\mathrm{poly}(\log n)$-approximation or apply to restricted inputs, like low dimension or small number of clusters $k$; e.g. [Bhaskara and Wijewardena, ICML'18; Cohen-Addad et al., NeurIPS'21; Cohen-Addad et al., ICML'22]. We then build on this facility location result and devise a fast MPC algorithm that achieves $O(1)$-bicriteria approximation for $k$-Median and for $k$-Means, namely, it computes $(1+\varepsilon)k$ clusters of cost within $O(1/\varepsilon^2)$-factor of the optimum for $k$ clusters. A primary technical tool that we introduce, and may be of independent interest, is a new MPC primitive for geometric aggregation, namely, computing for every data point a statistic of its approximate neighborhood, for statistics like range counting and nearest-neighbor search. Our implementation of this primitive works in high dimension, and is based on consistent hashing (aka sparse partition), a technique that was recently used for streaming algorithms [Czumaj et al., FOCS'22].

研究动机与目标

设计适用于高维欧几里得空间聚类的完全可扩展MPC算法，使得局部内存大小为n^σ（其中σ > 0为任意常数）。
在MPC模型中，以O(1)轮计算实现对设施位置问题、k-中位数和k-均值问题的O(1)-近似解。
克服先前研究的局限性，如需poly(log n)-近似解，或受限于低维输入或小k值等设置。
提出一种新型几何聚合原语，用于在高维空间中实现近似邻域统计，从而支持高效的MPC计算。

提出的方法

引入一种新颖的MPC原语，利用一致哈希（稀疏划分）技术计算近似邻域统计（如范围计数、最近邻）。
应用扰动权重与两阶段选择规则：(C1) 以概率µ/γ随机选择点；(C2) 选择局部邻域中权重最高的点。
采用基于2的幂次的半径划分方式对权重和半径进行设置，借助定理3.1实现高效的并行几何聚合计算。
通过使用2的幂次迭代猜测最优代价OPT_cl^z，最终返回使用至多(1 + 3μ)k个中心的最便宜解。
结合弱核心集构造与MPC兼容的实现方式，利用定理3.1在O(1)轮内验证中心选择条件。
通过O(log n)次并行运行提升成功概率，从而获得高概率边界保证。

实验结果

研究问题

RQ1能否在完全可扩展的MPC模型中，以O(1)轮计算和次线性局部内存，实现对设施位置问题的O(1)-近似解？
RQ2能否将设施位置算法扩展至在高维空间中实现对k-中位数和k-均值问题的O(1)-双准则近似解？
RQ3能否设计一种几何聚合原语，支持在次线性局部内存下高效执行高维空间中的邻域查询？
RQ4能否有效将一致哈希技术适配于MPC环境，以实现高维聚类的可证明近似保证？
RQ5在高维空间中，仍能支持O(1)轮、O(1)-近似聚类算法的最小局部内存大小（n^σ）是多少？

主要发现

本文首次提出完全可扩展的MPC算法，用于在一般几何设置下对设施位置问题实现O(1)-近似解，且仅需O(1)轮计算，局部内存为n^σ（σ > 0为任意常数）。
实现了对k-中位数和k-均值问题的O(1)-双准则近似解：使用(1 + ε)k个中心，且代价不超过最优k-中心代价的O(1/ε²)倍。
所提出的几何聚合原语可高效计算高维空间中的邻域统计（如范围计数、最近邻），其基础为一致哈希技术。
经过O(log n)次并行运行后，期望聚类代价以高概率被限制在O(2^z · β^z · γ^3 · OPT_cl^z / μ²)以内。
通过O(log n)次并行执行核心过程，将成功概率提升至1 - 1/poly(n)。
MPC实现仅需O(1)轮计算，总空间复杂度为O(n)，局部内存为n^σ，首次实现了高维聚类问题的完全可扩展O(1)-轮解决方案。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。