QUICK REVIEW

[論文レビュー] On Sampling Based Algorithms for k-Means

Goyal, Dishant, Dishant Goyal|arXiv (Cornell University)|Sep 16, 2019

Complexity and Algorithms in Graphs参考文献 29被引用数 2

ひとこと要約

本稿では、リスト-k-means問題に対する1反復D2サンプリングに基づくアルゴリズムを提示する。従来の研究に比べ、ストリーミング処理、安定性下での高速PTAS、並列計算の効率性を著しく向上させた。本手法は任意の固定されたt ≤ k個のクラスタに対して、(k/ε)^O(t/ε)個のtセンター集合のリストを生成し、高い確率で少なくとも1つが(1+ε)-近似解となることを保証する。これにより、制約付きk-means問題に対する4パスlogspaceストリーミングPTAS、および高速な並列・安定クラスタリングアルゴリズムが可能になる。

ABSTRACT

We generalise the results of Bhattacharya et al. [Bhattacharya et al., 2018] for the list-k-means problem defined as - for a (unknown) partition X₁, ..., X_k of the dataset X ⊆ ℝ^d, find a list of k-center-sets (each element in the list is a set of k centers) such that at least one of k-center-sets {c₁, ..., c_k} in the list gives an (1+ε)-approximation with respect to the cost function min_{permutation π} [∑_{i = 1}^{k} ∑_{x ∈ X_i} ||x - c_{π(i)}||²]. The list-k-means problem is important for the constrained k-means problem since algorithms for the former can be converted to {PTAS} for various versions of the latter. The algorithm for the list-k-means problem by Bhattacharya et al. is a D²-sampling based algorithm that runs in k iterations. Making use of a constant factor solution for the (classical or unconstrained) k-means problem, we generalise the algorithm of Bhattacharya et al. in two ways - (i) for any fixed set X_{j₁}, ..., X_{j_t} of t ≤ k clusters, the algorithm produces a list of (k/(ε))^{O(t/(ε))} t-center sets such that (w.h.p.) at least one of them is good for X_{j₁}, ..., X_{j_t}, and (ii) the algorithm runs in a single iteration. Following are the consequences of our generalisations: 1) Faster PTAS under stability and a parameterised reduction: Property (i) of our generalisation is useful in scenarios where finding good centers becomes easier once good centers for a few "bad" clusters have been chosen. One such case is clustering under stability of Awasthi et al. [Awasthi et al., 2010] where the number of such bad clusters is a constant. Using property (i), we significantly improve the running time of their algorithm from O(dn³) (k log{n})^{poly(1/(β), 1/(ε))} to O (dn³ (k/(ε)) ^{O(1/βε²)}). Another application is a parameterised reduction from the outlier version of k-means to the classical one where the bad clusters are the outliers. 2) Streaming algorithms: The sampling algorithm running in a single iteration (i.e., property (ii)) allows us to design a constant-pass, logspace streaming algorithm for the list-k-means problem. This can be converted to a constant-pass, logspace streaming PTAS for various constrained versions of the k-means problem. In particular, this gives a 3-pass, polylog-space streaming PTAS for the constrained binary k-means problem which in turn gives a 4-pass, polylog-space streaming PTAS for the generalised binary 𝓁₀-rank-r approximation problem. This is the first constant pass, polylog-space streaming algorithm for either of the two problems. Coreset based techniques, which is another approach for designing streaming algorithms in general, is not known to work for the constrained binary k-means problem to the best of our knowledge.

研究の動機と目的

反復的精錬を回避する、リスト-k-means問題のより効率的なサンプリングベースのアルゴリズムの開発。
制約付きk-means問題に対する4パスlogspaceストリーミングPTASの実現。
β分布インスタンスのような安定性仮定下でのPTASの高速化。
逐次的k反復のボトルネックを排除することで、高速な並列計算の支援。
ストリーミング、並列、安定クラスタリング設定を含む多様な計算モデルへのD2サンプリングフレームワークの一般化。

提案手法

任意の固定されたt ≤ kクラスタに対して、(k/ε)^O(t/ε)個のtセンター集合のリストを生成する1反復D2サンプリングアルゴリズムを提案。
多様な設定にわたって一様なサンプリングテンプレートを用い、アルゴリズムの変更ではなく文脈に応じた解析を適応する。
定数要因近似解を入力として用い、サンプリングベースのリスト生成により(1+ε)-近似解をブートストラップする。
リスト生成フレームワークを適用し、リスト-k-meansの2パスストリーミングアルゴリズムを設計。これにより、制約付きk-means問題に対する4パスストリーミングPTASが実現可能。
安定性下（β分布インスタンス）のクラスタリングに本手法を適応。実行時間はO(dn³(k log n)^poly(1/β,1/ε))からO(dn³(k/ε)^O(1/βε²))に短縮された。
CREWモデルにおいて、逐次的k反復フェーズを1回の並列実行可能なサンプリングフェーズに置き換えることで、高速な並列PTASを実現。

実験結果

リサーチクエスチョン

RQ11反復のサンプリングアルゴリズムが、リスト-k-means問題において、複数反復D2サンプリングの代替として近似保証を維持できるか？
RQ2リスト-k-meーンスフレームワークは、制約付きk-means問題のストリーミングおよびlogspace計算をサポートできるか？
RQ31反復アプローチにより、β分布インスタンスのような安定性仮定下でより高速なPTASが可能か？
RQ4反復構造内の逐次的依存関係を排除することで、アルゴリズムを高並列化できるか？
RQ5本フレームワークは、研究で扱ったもの以外の制約付きクラスタリング変種へも一般化可能か？

主な発見

提案されたアルゴリズムは1反復で実行され、任意の固定されたt ≤ kクラスタに対して、(k/ε)^O(t/ε)個のtセンター集合のリストを生成する。高確率で、少なくとも1つは(1+ε)-近似解となる。
リスト-k-means問題に対する2パスlogspaceストリーミングアルゴリズムが達成され、これにより多様な制約付きk-means問題に対する4パスlogspaceストリーミングPTASが実現可能となった。
β分布k-meansインスタンスでは、実行時間がO(dn³(k log n)^poly(1/β,1/ε))からO(dn³(k/ε)^O(1/βε²))に短縮され、著しく効率性が向上した。
CREWモデルにおいて、Nプロセッサを用いてO(poly(nε,k,d,1/ε) · n^{1−ε}/N)時間で高速な並列PTASが可能となった。
フレームワークは、ストリーミング、並列、安定クラスタリング設定の全般にわたり、D2サンプリングアプローチを一貫して一般化した。アルゴリズムの単純さと解析の複雑さが分離された。
1つのサンプリングテンプレートに文脈に応じた解析を適用することで、多様な計算モデルおよび問題変種をサポートできることが実証された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。