QUICK REVIEW

[论文解读] On Sampling Based Algorithms for k-Means

Goyal, Dishant, Dishant Goyal|arXiv (Cornell University)|Sep 16, 2019

Complexity and Algorithms in Graphs参考文献 29被引用 2

一句话总结

本文提出了一种基于单轮D2-采样法的列表-k-means问题算法，显著优于以往工作，实现了流式处理、稳定性条件下的更快PTAS，以及高效的并行计算。该方法为任意t ≤ k个聚类生成大小为(k/ε)^O(t/ε)的t-中心集列表，以高概率确保至少一个为(1+ε)-近似解，从而实现约束k-means问题的4轮对数空间流式PTAS，以及更快速的并行与稳定聚类算法。

ABSTRACT

We generalise the results of Bhattacharya et al. [Bhattacharya et al., 2018] for the list-k-means problem defined as - for a (unknown) partition X₁, ..., X_k of the dataset X ⊆ ℝ^d, find a list of k-center-sets (each element in the list is a set of k centers) such that at least one of k-center-sets {c₁, ..., c_k} in the list gives an (1+ε)-approximation with respect to the cost function min_{permutation π} [∑_{i = 1}^{k} ∑_{x ∈ X_i} ||x - c_{π(i)}||²]. The list-k-means problem is important for the constrained k-means problem since algorithms for the former can be converted to {PTAS} for various versions of the latter. The algorithm for the list-k-means problem by Bhattacharya et al. is a D²-sampling based algorithm that runs in k iterations. Making use of a constant factor solution for the (classical or unconstrained) k-means problem, we generalise the algorithm of Bhattacharya et al. in two ways - (i) for any fixed set X_{j₁}, ..., X_{j_t} of t ≤ k clusters, the algorithm produces a list of (k/(ε))^{O(t/(ε))} t-center sets such that (w.h.p.) at least one of them is good for X_{j₁}, ..., X_{j_t}, and (ii) the algorithm runs in a single iteration. Following are the consequences of our generalisations: 1) Faster PTAS under stability and a parameterised reduction: Property (i) of our generalisation is useful in scenarios where finding good centers becomes easier once good centers for a few "bad" clusters have been chosen. One such case is clustering under stability of Awasthi et al. [Awasthi et al., 2010] where the number of such bad clusters is a constant. Using property (i), we significantly improve the running time of their algorithm from O(dn³) (k log{n})^{poly(1/(β), 1/(ε))} to O (dn³ (k/(ε)) ^{O(1/βε²)}). Another application is a parameterised reduction from the outlier version of k-means to the classical one where the bad clusters are the outliers. 2) Streaming algorithms: The sampling algorithm running in a single iteration (i.e., property (ii)) allows us to design a constant-pass, logspace streaming algorithm for the list-k-means problem. This can be converted to a constant-pass, logspace streaming PTAS for various constrained versions of the k-means problem. In particular, this gives a 3-pass, polylog-space streaming PTAS for the constrained binary k-means problem which in turn gives a 4-pass, polylog-space streaming PTAS for the generalised binary 𝓁₀-rank-r approximation problem. This is the first constant pass, polylog-space streaming algorithm for either of the two problems. Coreset based techniques, which is another approach for designing streaming algorithms in general, is not known to work for the constrained binary k-means problem to the best of our knowledge.

研究动机与目标

开发一种更高效的基于采样的列表-k-means问题算法，避免迭代优化。
实现约束k-means问题的4轮对数空间流式PTAS。
在稳定性条件（如β-分布实例）下加速PTAS。
通过消除序列k-迭代瓶颈，支持快速并行计算。
将D2-采样框架推广至流式处理、并行计算与稳定聚类等多种计算模型。

提出的方法

提出一种单轮D2-采样算法，为任意固定t ≤ k个聚类生成大小为(k/ε)^O(t/ε)的t-中心集列表。
在多种场景中采用统一的采样模板，根据上下文调整分析方法，而非修改算法本身。
利用常数因子近似解作为输入，通过基于采样的列表生成方法实现(1+ε)-近似解。
将列表生成框架应用于设计列表-k-means的2轮流式算法，从而实现约束k-means问题的4轮流式PTAS。
将方法适配至稳定性条件下的聚类（如β-分布实例），将运行时间从O(dn³(k log n)^poly(1/β,1/ε))降低至O(dn³(k/ε)^O(1/βε²))。
通过在CREW模型中用单轮可并行采样阶段替代序列k-迭代步骤，实现快速并行PTAS。

实验结果

研究问题

RQ1能否通过单轮采样算法替代多轮D2-采样方法，同时保持近似保证？
RQ2列表-k-means框架能否扩展以支持约束k-means问题的流式处理与对数空间计算？
RQ3单轮方法是否能在β-分布实例等稳定性假设下实现更快的PTAS？
RQ4能否通过消除迭代结构中的序列依赖，使算法实现高度并行化？
RQ5该框架能否推广至本文研究范围之外的其他约束聚类变体？

主要发现

所提算法仅需单轮迭代，生成大小为(k/ε)^O(t/ε)的t-中心集列表，且以高概率确保对于任意固定t ≤ k个聚类，至少一个为(1+ε)-近似解。
实现了2轮对数空间流式算法用于列表-k-means，从而支持多种约束k-means问题的4轮对数空间流式PTAS。
对于β-分布的k-means实例，运行时间从O(dn³(k log n)^poly(1/β,1/ε))降低至O(dn³(k/ε)^O(1/βε²))，效率显著提升。
在CREW模型中，该算法实现了快速并行PTAS，使用N个处理器时运行时间为O(poly(nε,k,d,1/ε) · n^{1−ε}/N)。
该框架在流式处理、并行计算与稳定聚类设置中统一推广了D2-采样方法，实现算法简洁性与分析复杂性解耦。
该方法表明，单一采样模板结合上下文特定分析，可支持多样化的计算模型与问题变体。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。