Skip to main content
QUICK REVIEW

[论文解读] Incremental Graph Construction Enables Robust Spectral Clustering of Texts

Marko Pranjić, Boshko Koloski|arXiv (Cornell University)|Mar 3, 2026
Advanced Graph Neural Networks被引用 0
一句话总结

该论文提出一种增量式的 k-NN 图构建方法,能够在任意 k 下保证连通性,从而在低 k 时实现对文本嵌入的鲁棒谱聚类,并在 Massive Text Embedding Benchmark 的六个数据集上进行验证。它在稀疏情形下优于标准的 k-NN,且在较大 k 时与之相当。

ABSTRACT

Neighborhood graphs are a critical but often fragile step in spectral clustering of text embeddings. On realistic text datasets, standard $k$-NN graphs can contain many disconnected components at practical sparsity levels (small $k$), making spectral clustering degenerate and sensitive to hyperparameters. We introduce a simple incremental $k$-NN graph construction that preserves connectivity by design: each new node is linked to its $k$ nearest previously inserted nodes, which guarantees a connected graph for any $k$. We provide an inductive proof of connectedness and discuss implications for incremental updates when new documents arrive. We validate the approach on spectral clustering of SentenceTransformer embeddings using Laplacian eigenmaps across six clustering datasets from the Massive Text Embedding Benchmark. Compared to standard $k$-NN graphs, our method outperforms in the low-$k$ regime where disconnected components are prevalent, and matches standard $k$-NN at larger $k$.

研究动机与目标

  • 由于可能的断连性,说明标准 k-NN 图在文本嵌入的谱聚类中的脆弱性。
  • 提出一个增量图构建算法,保证无论 k 如何都连通。
  • 评估增量图对多文本数据集的谱聚类性能影响。
  • 在节点排序和不同嵌入模型下,评估聚类结果的稳定性和鲁棒性。

提出的方法

  • 提出一种增量式 k-NN 图构建,新的节点在已插入节点中连接到其最近的 k 个邻居,从而保证图的连通性。
  • 给出一个形式化的归纳证明,增量图在任意 N 和 k 下均为连通。
  • 通过在增量增长过程中观察邻接矩阵的有限变化,分析计算成本的潜在收益。
  • 评估两种 Laplacian Eigenmaps 的亲和性方案:基于连接的和基于高斯核的,并用 k-NN 图来证明其使用的合理性。
  • 通过 Laplacian 特征映射进行聚类,随后采用基于 QR 的聚类方法,并在多数据集上与标准的 k-NN 图进行比较。
Figure 1: Proposed methodology. Top : General spectral clustering pipeline. After embedding the documents, a graph is constructed, projected into eigenspace, and clustered using k-means. Bottom : Comparison of a standard nearest-neighbor graph (which may be disconnected) with the proposed incrementa
Figure 1: Proposed methodology. Top : General spectral clustering pipeline. After embedding the documents, a graph is constructed, projected into eigenspace, and clustered using k-means. Bottom : Comparison of a standard nearest-neighbor graph (which may be disconnected) with the proposed incrementa

实验结果

研究问题

  • RQ1增量图构造是否能在任意 k 和 N 下保证全局连通性?
  • RQ2使用增量连通的 k-NN 图相比标准 k-NN 图,在文本嵌入的谱聚类质量方面,尤其在低 k 情况下,有何影响?
  • RQ3方法对节点排序和嵌入模型变化的敏感性如何?
  • RQ4在低维谱嵌入空间中,增量图方法是否能达到接近高维聚类方法的性能?

主要发现

  • 增量 k-NN 图对任意节点数均能保证单一连通分量,使谱聚类更可靠。
  • 在标准 k-NN 图常出现断连的低 k 情况下,增量方法在多个数据集上呈现更稳定的聚类性能。
  • 随着 k 增大,增量方法的性能趋近于标准 k-NN 图,在较大 k 时与其效果相符。
  • 在多数据集和多种嵌入模型下,该方法对节点排序具有鲁棒性,聚类结果方差较小。
  • 增量图所得到的低维谱嵌入能够在与高维 K-means 基线相比时,展现出具有竞争力的聚类结果。
Figure 2: Clustering performance of embeddings induced using incremental $k$ -NN (Ours) and standard $k$ -NN for a range of parameter $k$ . Points, where the standard $k$ -NN neighborhood induces connected graphs, are marked, while in Ours, the graph is always connected. Results for sentence-to-sent
Figure 2: Clustering performance of embeddings induced using incremental $k$ -NN (Ours) and standard $k$ -NN for a range of parameter $k$ . Points, where the standard $k$ -NN neighborhood induces connected graphs, are marked, while in Ours, the graph is always connected. Results for sentence-to-sent

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。