Skip to main content
QUICK REVIEW

[论文解读] Text Clustering with Large Language Model Embeddings

Alina Petukhova, João P. Matos-Carvalho|arXiv (Cornell University)|Mar 22, 2024
Natural Language Processing Techniques被引用 9
一句话总结

本文评估传统方法与大型语言模型(LLMs)的各种嵌入在多数据集和聚类算法中的文本聚类效果,并分析通过摘要与嵌入尺寸进行降维的影响。

ABSTRACT

Text clustering is an important method for organising the increasing volume of digital content, aiding in the structuring and discovery of hidden patterns in uncategorised data. The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms. This study argues that recent advancements in large language models (LLMs) have the potential to enhance this task. The research investigates how different textual embeddings, particularly those utilised in LLMs, and various clustering algorithms influence the clustering of text datasets. A series of experiments were conducted to evaluate the impact of embeddings on clustering results, the role of dimensionality reduction through summarisation, and the adjustment of model size. The findings indicate that LLM embeddings are superior at capturing subtleties in structured language. OpenAI's GPT-3.5 Turbo model yields better results in three out of five clustering metrics across most tested datasets. Most LLM embeddings show improvements in cluster purity and provide a more informative silhouette score, reflecting a refined structural understanding of text data compared to traditional methods. Among the more lightweight models, BERT demonstrates leading performance. Additionally, it was observed that increasing model dimensionality and employing summarisation techniques do not consistently enhance clustering efficiency, suggesting that these strategies require careful consideration for practical application. These results highlight a complex balance between the need for refined text representation and computational feasibility in text clustering applications. This study extends traditional text clustering frameworks by integrating embeddings from LLMs, offering improved methodologies and suggesting new avenues for future research in various types of textual analysis.

研究动机与目标

  • 评估不同文本嵌入对聚类质量的影响。
  • 评估通过摘要进行降维在聚类表现中的作用。
  • 研究嵌入尺寸对聚类结果和计算权衡的影响。
  • 比较开源嵌入与 OpenAI 嵌入在结构化文本聚类中的差异。
  • 为实际聚类应用提供嵌入选择的指导。

提出的方法

  • 在多数据集(CSTR, SyskillWebert, 20Newsgroups, MN-DS)与聚类算法(K-means, K-means++, AHC, Fuzzy C-means, Spectral)间进行比较。
  • 以 TF-IDF 作为基线,并包含来自 Hugging Face 的 BERT、OpenAI、Falcon、和 LLaMA-2 嵌入。
  • 对数据进行预处理(去除元数据、HTML、非拉丁字符)。
  • 使用外部指标(F1S, ARI, HS)和内部指标(SS, CHI)进行评估。
  • 进行多种模型的摘要实验;分析更大嵌入模型;使用 PCA 和 t-SNE 进行可视化。
  • 探讨嵌入尺寸(Falcon/LLaMA-2 家族)对聚类性能的影响。
(a) LLaMA-2-7b-chat-hf.
(a) LLaMA-2-7b-chat-hf.

实验结果

研究问题

  • RQ1哪些嵌入在不同数据集上能获得最高的外部聚类质量(F1S、ARI、HS)?
  • RQ2通过摘要进行降维是否在聚类结果上表现出一致的改进?
  • RQ3嵌入尺寸如何影响聚类性能,哪些模型最受益?
  • RQ4开源嵌入与 OpenAI 嵌入在结构化文本聚类中的表现差异?
  • RQ5在使用更大规模的 LLM 嵌入时,聚类质量与计算成本之间的实际权衡是什么?

主要发现

数据集嵌入最佳算法F1SARIHSSSCHI总计
DS1TF-IDFk-means0.670.380.460.01640/5
DS1BERTSpectral0.850.600.630.118253/5
DS1OpenAIk-means0.840.590.640.066131/5
DS1LLaMA-2k-means0.410.090.170.112491/5
DS1Falconk-means0.740.390.480.111340/5
DS2TF-IDFSpectral0.820.630.580.02880/5
DS2BERTAHC0.740.580.530.152370/5
DS2OpenAIAHC0.900.790.750.070193/5
DS2LLaMA-2k-means0.510.210.250.137690/5
DS2Falconk-means++0.450.260.300.170852/5
DS3TF-IDFSpectral0.350.130.28-0.002370/5
DS3BERTk-means0.430.250.440.0484120/5
DS3OpenAIk-means0.690.520.660.0352133/5
DS3LLaMA-2AHC0.170.110.260.0252640/5
DS3Falconk-means0.260.150.300.07111202/5
DS4TF-IDFk-means0.290.130.480.034170/5
DS4BERTk-means0.350.240.550.072611/5
DS4OpenAIk-means0.380.260.580.053423/5
DS4LLaMA-2k-means0.210.110.400.053880/5
DS4Falconk-means++0.270.160.480.071921/5
DS5TF-IDFAHC0.310.090.290.010370/5
DS5BERTk-means++0.430.270.420.0601782/5
DS5OpenAISpectral0.450.250.410.0361201/5
DS5LLaMA-2AHC0.230.100.230.0312630/5
DS5Falconk-means++0.280.120.250.0703592/5
  • OpenAI 嵌入在结构化、正式文本上通常在多项指标上获得更优的聚类性能。
  • 使用带有 OpenAI 嵌入的 k-means 往往能达到较高的 ARI、F1S 和 HS,但 Silhouette 与 CHI 可能较低,表明可能存在空间/形状效应。
  • 开源嵌入(Falcon、LLaMA-2)结果参差不齐;BERT 在开源选项中通常表现良好,且 Falcon-7b 在若干情况下优于 LLaMA-2-7b。
  • 摘要并未在所有模型上持续改进聚类;某些模型因信息损失导致性能下降,尤其是较小的模型。
  • 嵌入尺寸的增加可能提升某些模型的聚类效果(如 Falcon-7b 至 Falcon-40b),但并非普遍适用,且更大嵌入带来更高的计算成本。
  • 降维可视化(PCA/t-SNE)显示在某些较大模型(如 LLaMA-13b、Falcon-7b)上比较小的模型有更好的类别分离。
(b) LLaMA-2-13b-chat-hf.
(b) LLaMA-2-13b-chat-hf.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。