Skip to main content
QUICK REVIEW

[论文解读] Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

Julian Bellavita, Matthew Rubino|arXiv (Cornell University)|Jan 23, 2026
Advanced Clustering Algorithms Research被引用 0
一句话总结

该论文提出了一种使用通信避免线性代数原语的分布式内存多GPU实现的精确 Kernel K-means(在百万规模数据上),特别是一种将 GEMM 和 SpMM 融合以最小化通信的新颖 1.5D 算法。

ABSTRACT

Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and implementing it on a single GPU. However, that approach cannot run on datasets with more than approximately 80,000 samples due to limited GPU memory. In this work, we address this issue by presenting a suite of distributed-memory parallel algorithms for large-scale Kernel K-means clustering on multi-GPU systems. Our approach maps the most computationally expensive components of Kernel K-means onto communication-efficient distributed linear algebra primitives uniquely tailored for Kernel K-means, enabling highly scalable implementations that efficiently cluster million-scale datasets. Central to our work is the design of partitioning schemes that enable communication-efficient composition of the linear algebra primitives that appear in Kernel K-means. Our 1.5D algorithm consistently achieves the highest performance, enabling Kernel K-means to scale to data one to two orders of magnitude larger than previously practical. On 256 GPUs, it achieves a geometric mean weak scaling efficiency of $79.7\%$ and a geometric mean strong scaling speedup of $4.2 imes$. Compared to our 1D algorithm, the 1.5D approach achieves up to a $3.6 imes$ speedup on 256 GPUs and reduces clustering time from over an hour to under two seconds relative to a single-GPU sliding window implementation. Our results show that distributed algorithms designed with application-specific linear algebraic formulations can achieve substantial performance improvement.

研究动机与目标

  • 为超出单个 GPU 内存限制的大规模数据集提供可扩展、精确的 Kernel K-means 聚类动机。
  • 开发将 Kernel K-means 组件映射到通信高效的线性代数原语的分布式内存算法。
  • 探索分区和组合策略,以在保持负载均衡的同时最小化进程间通信。
  • 展示实际可扩展性并提供开源软件实现。

提出的方法

  • 将 Kernel K-means 映射到分布式 GEMM 与稀疏-密集乘法(SpMM/SpMV)在分布式内存系统上。
  • 提出四种算法(1D、混合 1D、1.5D、2D),采用领域特定分区以降低通信。
  • 引入使用 SUMMA 的 GEMM 的 1.5D 算法,以及用于 V 的 1D 分布以实现高效的 SpMM 与聚类更新。
  • 利用 V 的稀疏结构(每列一个非零)实现负载均衡的 SpMM 和最小通信。
  • 提供开源的 GPU 实现(Vivaldi),并与单 GPU 滑动窗口基线进行比较。
  • 给出各算法下对 K 和 D^T 的通信成本分析,并显示 1.5D 实现了最佳的渐近效率。
Figure 1 : 1.5D SpMM algorithm on $P=4$ processes. $\mathbf{V}$ is partitioned 1D columnwise and $\mathbf{K}$ in 2D. (1) The nonzeros of each $\mathbf{V}$ partition are replicated along the corresponding process row. (2) Each process performs a local SpMM with its $\mathbf{V}$ replicas and local $\m
Figure 1 : 1.5D SpMM algorithm on $P=4$ processes. $\mathbf{V}$ is partitioned 1D columnwise and $\mathbf{K}$ in 2D. (1) The nonzeros of each $\mathbf{V}$ partition are replicated along the corresponding process row. (2) Each process performs a local SpMM with its $\mathbf{V}$ replicas and local $\m

实验结果

研究问题

  • RQ1如何在多GPU集群上扩展 Kernel K-means 到百万规模数据集,同时计算精确解?
  • RQ2有哪些分布式内存分区和通信策略能在 Kernel K-means 的 GEMM、SpMM、SpMV 操作中实现负载平衡并最小化数据传输?
  • RQ3在 GPU 上,1.5D 架构在弱尺度和强尺度下是否能超越 1D 和 2D 变体?
  • RQ4在真实数据集上,所提算法的实际性能如何,与单 GPU 方法相比有何差异?

主要发现

  • 1.5D 算法实现了最高性能,使 Kernel K-means 能将数据规模扩展到此前实际应用中一到两个数量级的数据。
  • 在 256 个 GPU 上,1.5D 算法达到几何平均弱缩放效率 79.7%,几何平均强缩放加速为 4.2×。
  • 与 1D 基线相比,1.5D 在强缩放方面可提供最高 3.6× 的加速,且比单 GPU 滑动窗口方法快超过 2000×。
  • 1.5D 与 2D 算法可处理超过 150 万个点而不耗尽内存。
  • 实现开源于 Vivaldi 包中,项目可在提供的 GitHub 链接获取。
Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。