QUICK REVIEW

[論文レビュー] Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

Julian Bellavita, Matthew Rubino|arXiv (Cornell University)|Jan 23, 2026

Advanced Clustering Algorithms Research被引用数 0

ひとこと要約

要約: 本論文は、通信回避型線形代数プリミティブを用いた、分散メモリ・マルチGPU実装による正確な Kernel K-means の大規模データ（百万規模）処理を提案します。特に、GEMMとSpMMを融合して通信を最小化する新規の1.5Dアルゴリズムを強調します。

ABSTRACT

Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and implementing it on a single GPU. However, that approach cannot run on datasets with more than approximately 80,000 samples due to limited GPU memory. In this work, we address this issue by presenting a suite of distributed-memory parallel algorithms for large-scale Kernel K-means clustering on multi-GPU systems. Our approach maps the most computationally expensive components of Kernel K-means onto communication-efficient distributed linear algebra primitives uniquely tailored for Kernel K-means, enabling highly scalable implementations that efficiently cluster million-scale datasets. Central to our work is the design of partitioning schemes that enable communication-efficient composition of the linear algebra primitives that appear in Kernel K-means. Our 1.5D algorithm consistently achieves the highest performance, enabling Kernel K-means to scale to data one to two orders of magnitude larger than previously practical. On 256 GPUs, it achieves a geometric mean weak scaling efficiency of $79.7\%$ and a geometric mean strong scaling speedup of $4.2 imes$. Compared to our 1D algorithm, the 1.5D approach achieves up to a $3.6 imes$ speedup on 256 GPUs and reduces clustering time from over an hour to under two seconds relative to a single-GPU sliding window implementation. Our results show that distributed algorithms designed with application-specific linear algebraic formulations can achieve substantial performance improvement.

研究の動機と目的

大規模データセットを跨る単一GPU memory 制限を超えるための、スケーラブルで正確な Kernel K-means クラスタリングの動機付け。
Kernel K-means の部品を通信効率の良い線形代数プリミティブへマッピングする分散メモリアルゴリズムの開発。
プロセス間通信を最小化しつつ、負荷分散を維持するパーティショニングと構成戦略の検討。
実用的なスケーラビリティを示し、オープンソースソフトウェア実装を提供。

提案手法

Kernel K-means を分散メモリシステム上で分散GEMMとスパース-デンス乗算（SpMM/SpMV）へマッピング。
通信を減らすためのドメイン固有のパーティショニングを伴う4つのアルゴリズム（1D、Hybrid 1D、1.5D、2D）を提案。
SUMMAを用いるGEMMとVの1D分布を組み合わせる1.5Dアルゴリズムを導入し、効率的なSpMMとクラスタ更新を可能にする。
Vの疎構造（各列に1つの非零）を活用して、負荷平衡のとれたSpMMと最小の通信を達成。
オープンソースのGPU実装（Vivaldi）を提供し、単一GPUのスライディングウィンドウベースと比較。
各アルゴリズムでKとD^Tの通信コスト解析を提示し、1.5Dが最適な漸近効率を達成することを示す。

Figure 1 : 1.5D SpMM algorithm on $P=4$ processes. $\mathbf{V}$ is partitioned 1D columnwise and $\mathbf{K}$ in 2D. (1) The nonzeros of each $\mathbf{V}$ partition are replicated along the corresponding process row. (2) Each process performs a local SpMM with its $\mathbf{V}$ replicas and local $\m

実験結果

リサーチクエスチョン

RQ1Kernel K-meansをマルチ-GPUクラスタ上で正確解を計算しつつ百万規模データセットへ拡張するにはどうすればよいか。
RQ2Kernel K-meansのGEMM、SpMM、SpMV操作を最適に分散メモリで分割し、負荷を均等化しデータ転送を最小化する戦略は何か。
RQ31Dと2Dの変種に対して、1.5Dアーキテクチャが弱スケーリング・強スケーリングの両方で優れているか。
RQ4提案アルゴリズムの実データセットでの実用的性能はどの程度か、単一GPUアプローチと比較してどうか。

主な発見

1.5Dアルゴリズムは最高の性能を実現し、Kernel K-meansを従来実用的だったデータサイズの1〜2桁以上へスケールさせる。
256 GPUで、1.5Dアルゴリズムは幾何平均の弱いスケーリング効率を79.7%、幾何平均の強いスケーリングスピードアップを4.2×達成。
1Dベースラインと比較して、1.5Dアルゴリズムは強スケーリングで最大3.6×の速度up、単一GPUのスライディングウィンドウ手法よりも2000倍超の高速クラスタリングを実現。
1.5Dおよび2Dアルゴリズムは1.5百万点以上をメモリを使い切ることなく処理可能。
実装はVivaldiパッケージでオープンソース化されており、プロジェクトは提供されたGitHubリンクで利用可能。

Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。