QUICK REVIEW

[论文解读] Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

Julian Bellavita, Matthew Rubino|arXiv (Cornell University)|Jan 23, 2026

Advanced Clustering Algorithms Research被引用 0

一句话总结

该论文提出了一种使用通信避免线性代数原语的分布式内存多GPU实现的精确 Kernel K-means（在百万规模数据上），特别是一种将 GEMM 和 SpMM 融合以最小化通信的新颖 1.5D 算法。

ABSTRACT

Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and implementing it on a single GPU. However, that approach cannot run on datasets with more than approximately 80,000 samples due to limited GPU memory. In this work, we address this issue by presenting a suite of distributed-memory parallel algorithms for large-scale Kernel K-means clustering on multi-GPU systems. Our approach maps the most computationally expensive components of Kernel K-means onto communication-efficient distributed linear algebra primitives uniquely tailored for Kernel K-means, enabling highly scalable implementations that efficiently cluster million-scale datasets. Central to our work is the design of partitioning schemes that enable communication-efficient composition of the linear algebra primitives that appear in Kernel K-means. Our 1.5D algorithm consistently achieves the highest performance, enabling Kernel K-means to scale to data one to two orders of magnitude larger than previously practical. On 256 GPUs, it achieves a geometric mean weak scaling efficiency of $79.7\%$ and a geometric mean strong scaling speedup of $4.2 imes$. Compared to our 1D algorithm, the 1.5D approach achieves up to a $3.6 imes$ speedup on 256 GPUs and reduces clustering time from over an hour to under two seconds relative to a single-GPU sliding window implementation. Our results show that distributed algorithms designed with application-specific linear algebraic formulations can achieve substantial performance improvement.

研究动机与目标

为超出单个 GPU 内存限制的大规模数据集提供可扩展、精确的 Kernel K-means 聚类动机。
开发将 Kernel K-means 组件映射到通信高效的线性代数原语的分布式内存算法。
探索分区和组合策略，以在保持负载均衡的同时最小化进程间通信。
展示实际可扩展性并提供开源软件实现。

提出的方法

将 Kernel K-means 映射到分布式 GEMM 与稀疏-密集乘法（SpMM/SpMV）在分布式内存系统上。
提出四种算法（1D、混合 1D、1.5D、2D），采用领域特定分区以降低通信。
引入使用 SUMMA 的 GEMM 的 1.5D 算法，以及用于 V 的 1D 分布以实现高效的 SpMM 与聚类更新。
利用 V 的稀疏结构（每列一个非零）实现负载均衡的 SpMM 和最小通信。
提供开源的 GPU 实现（Vivaldi），并与单 GPU 滑动窗口基线进行比较。
给出各算法下对 K 和 D^T 的通信成本分析，并显示 1.5D 实现了最佳的渐近效率。

Figure 1 : 1.5D SpMM algorithm on $P=4$ processes. $\mathbf{V}$ is partitioned 1D columnwise and $\mathbf{K}$ in 2D. (1) The nonzeros of each $\mathbf{V}$ partition are replicated along the corresponding process row. (2) Each process performs a local SpMM with its $\mathbf{V}$ replicas and local $\m

实验结果

研究问题

RQ1如何在多GPU集群上扩展 Kernel K-means 到百万规模数据集，同时计算精确解？
RQ2有哪些分布式内存分区和通信策略能在 Kernel K-means 的 GEMM、SpMM、SpMV 操作中实现负载平衡并最小化数据传输？
RQ3在 GPU 上，1.5D 架构在弱尺度和强尺度下是否能超越 1D 和 2D 变体？
RQ4在真实数据集上，所提算法的实际性能如何，与单 GPU 方法相比有何差异？

主要发现

1.5D 算法实现了最高性能，使 Kernel K-means 能将数据规模扩展到此前实际应用中一到两个数量级的数据。
在 256 个 GPU 上，1.5D 算法达到几何平均弱缩放效率 79.7%，几何平均强缩放加速为 4.2×。
与 1D 基线相比，1.5D 在强缩放方面可提供最高 3.6× 的加速，且比单 GPU 滑动窗口方法快超过 2000×。
1.5D 与 2D 算法可处理超过 150 万个点而不耗尽内存。
实现开源于 Vivaldi 包中，项目可在提供的 GitHub 链接获取。

Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。