QUICK REVIEW

[论文解读] Text Clustering with Large Language Model Embeddings

Alina Petukhova, João P. Matos-Carvalho|arXiv (Cornell University)|Mar 22, 2024

Natural Language Processing Techniques被引用 9

一句话总结

本文评估传统方法与大型语言模型（LLMs）的各种嵌入在多数据集和聚类算法中的文本聚类效果，并分析通过摘要与嵌入尺寸进行降维的影响。

ABSTRACT

Text clustering is an important method for organising the increasing volume of digital content, aiding in the structuring and discovery of hidden patterns in uncategorised data. The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms. This study argues that recent advancements in large language models (LLMs) have the potential to enhance this task. The research investigates how different textual embeddings, particularly those utilised in LLMs, and various clustering algorithms influence the clustering of text datasets. A series of experiments were conducted to evaluate the impact of embeddings on clustering results, the role of dimensionality reduction through summarisation, and the adjustment of model size. The findings indicate that LLM embeddings are superior at capturing subtleties in structured language. OpenAI's GPT-3.5 Turbo model yields better results in three out of five clustering metrics across most tested datasets. Most LLM embeddings show improvements in cluster purity and provide a more informative silhouette score, reflecting a refined structural understanding of text data compared to traditional methods. Among the more lightweight models, BERT demonstrates leading performance. Additionally, it was observed that increasing model dimensionality and employing summarisation techniques do not consistently enhance clustering efficiency, suggesting that these strategies require careful consideration for practical application. These results highlight a complex balance between the need for refined text representation and computational feasibility in text clustering applications. This study extends traditional text clustering frameworks by integrating embeddings from LLMs, offering improved methodologies and suggesting new avenues for future research in various types of textual analysis.

研究动机与目标

评估不同文本嵌入对聚类质量的影响。
评估通过摘要进行降维在聚类表现中的作用。
研究嵌入尺寸对聚类结果和计算权衡的影响。
比较开源嵌入与 OpenAI 嵌入在结构化文本聚类中的差异。
为实际聚类应用提供嵌入选择的指导。

提出的方法

在多数据集（CSTR, SyskillWebert, 20Newsgroups, MN-DS）与聚类算法（K-means, K-means++, AHC, Fuzzy C-means, Spectral）间进行比较。
以 TF-IDF 作为基线，并包含来自 Hugging Face 的 BERT、OpenAI、Falcon、和 LLaMA-2 嵌入。
对数据进行预处理（去除元数据、HTML、非拉丁字符）。
使用外部指标（F1S, ARI, HS）和内部指标（SS, CHI）进行评估。
进行多种模型的摘要实验；分析更大嵌入模型；使用 PCA 和 t-SNE 进行可视化。
探讨嵌入尺寸（Falcon/LLaMA-2 家族）对聚类性能的影响。

实验结果

研究问题

RQ1哪些嵌入在不同数据集上能获得最高的外部聚类质量（F1S、ARI、HS）？
RQ2通过摘要进行降维是否在聚类结果上表现出一致的改进？
RQ3嵌入尺寸如何影响聚类性能，哪些模型最受益？
RQ4开源嵌入与 OpenAI 嵌入在结构化文本聚类中的表现差异？
RQ5在使用更大规模的 LLM 嵌入时，聚类质量与计算成本之间的实际权衡是什么？

主要发现

数据集	嵌入	最佳算法	F1S	ARI	HS	SS	CHI	总计
DS1	TF-IDF	k-means	0.67	0.38	0.46	0.016	4	0/5
DS1	BERT	Spectral	0.85	0.60	0.63	0.118	25	3/5
DS1	OpenAI	k-means	0.84	0.59	0.64	0.066	13	1/5
DS1	LLaMA-2	k-means	0.41	0.09	0.17	0.112	49	1/5
DS1	Falcon	k-means	0.74	0.39	0.48	0.111	34	0/5
DS2	TF-IDF	Spectral	0.82	0.63	0.58	0.028	8	0/5
DS2	BERT	AHC	0.74	0.58	0.53	0.152	37	0/5
DS2	OpenAI	AHC	0.90	0.79	0.75	0.070	19	3/5
DS2	LLaMA-2	k-means	0.51	0.21	0.25	0.137	69	0/5
DS2	Falcon	k-means++	0.45	0.26	0.30	0.170	85	2/5
DS3	TF-IDF	Spectral	0.35	0.13	0.28	-0.002	37	0/5
DS3	BERT	k-means	0.43	0.25	0.44	0.048	412	0/5
DS3	OpenAI	k-means	0.69	0.52	0.66	0.035	213	3/5
DS3	LLaMA-2	AHC	0.17	0.11	0.26	0.025	264	0/5
DS3	Falcon	k-means	0.26	0.15	0.30	0.071	1120	2/5
DS4	TF-IDF	k-means	0.29	0.13	0.48	0.034	17	0/5
DS4	BERT	k-means	0.35	0.24	0.55	0.072	61	1/5
DS4	OpenAI	k-means	0.38	0.26	0.58	0.053	42	3/5
DS4	LLaMA-2	k-means	0.21	0.11	0.40	0.053	88	0/5
DS4	Falcon	k-means++	0.27	0.16	0.48	0.071	92	1/5
DS5	TF-IDF	AHC	0.31	0.09	0.29	0.010	37	0/5
DS5	BERT	k-means++	0.43	0.27	0.42	0.060	178	2/5
DS5	OpenAI	Spectral	0.45	0.25	0.41	0.036	120	1/5
DS5	LLaMA-2	AHC	0.23	0.10	0.23	0.031	263	0/5
DS5	Falcon	k-means++	0.28	0.12	0.25	0.070	359	2/5

OpenAI 嵌入在结构化、正式文本上通常在多项指标上获得更优的聚类性能。
使用带有 OpenAI 嵌入的 k-means 往往能达到较高的 ARI、F1S 和 HS，但 Silhouette 与 CHI 可能较低，表明可能存在空间/形状效应。
开源嵌入（Falcon、LLaMA-2）结果参差不齐；BERT 在开源选项中通常表现良好，且 Falcon-7b 在若干情况下优于 LLaMA-2-7b。
摘要并未在所有模型上持续改进聚类；某些模型因信息损失导致性能下降，尤其是较小的模型。
嵌入尺寸的增加可能提升某些模型的聚类效果（如 Falcon-7b 至 Falcon-40b），但并非普遍适用，且更大嵌入带来更高的计算成本。
降维可视化（PCA/t-SNE）显示在某些较大模型（如 LLaMA-13b、Falcon-7b）上比较小的模型有更好的类别分离。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。