QUICK REVIEW

[论文解读] TopicGPT: A Prompt-based Topic Modeling Framework

Chau Pham, Alexander Hoyle|arXiv (Cornell University)|Nov 2, 2023

Topic Modeling被引用 11

一句话总结

TopicGPT 使用提示型的大型语言模型从文本语料库生成并分配可解释的主题，从而与人类真实主题的对齐度高于 LDA 和 BERTopic，并在无需重新训练的情况下支持主题改进与层次化。

ABSTRACT

Topic modeling is a well-established technique for exploring text corpora. Conventional topic models (e.g., LDA) represent topics as bags of words that often require "reading the tea leaves" to interpret; additionally, they offer users minimal control over the formatting and specificity of resulting topics. To tackle these issues, we introduce TopicGPT, a prompt-based framework that uses large language models (LLMs) to uncover latent topics in a text collection. TopicGPT produces topics that align better with human categorizations compared to competing methods: it achieves a harmonic mean purity of 0.74 against human-annotated Wikipedia topics compared to 0.64 for the strongest baseline. Its topics are also interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions. Moreover, the framework is highly adaptable, allowing users to specify constraints and modify topics without the need for model retraining. By streamlining access to high-quality and interpretable topics, TopicGPT represents a compelling, human-centered approach to topic modeling.

研究动机与目标

开发一个以人为本的主题建模框架，产出可解释的主题并附有自然语言标签与描述。
利用迭代的 LLM 提示从文档样本中生成并改进主题，而无需重新训练。
提供带有支持性引用的文档-主题分配，以提升结果的可验证性。
在需要时启用层次扩展以探索子主题与更细粒度的主题。
评估主题输出相对于真实人工注释的鲁棒性、稳定性与对齐性。

提出的方法

通过对文档样本和现有种子主题进行迭代提示，生成主题。
利用嵌入和频率阈值，通过合并近似重复的主题并删除不常见的主题来精炼主题。
可选地通过提示生成基于相关文档的子主题，扩展为主题层次结构。
通过返回主题标签、描述和支持性引用的 LLM 提示将文档分配到主题。
加入自我校正步骤，以修正主题分配中的幻觉和格式化问题。

Figure 1: Overview of TopicGPT. 1) Topic Generation : Given a corpus and a list of manually-curated seed topics, TopicGPT identifies topics within each corpus document. The framework then refines the topic list by merging repeated topics and removing infrequent topics. 2) Topic Assignment : Given th

实验结果

研究问题

RQ1TopicGPT 生成的主题在跨数据集上是否比 LDA 和 BERTopic 更接近人工标注的真实主题？
RQ2TopicGPT 的输出对提示变体、样本选择以及开源与闭源 LLM 的鲁棒性如何？
RQ3TopicGPT 是否能够可靠地产生可解释的主题，带有自然语言标签和描述性引用，以支持分配？
RQ4将 TopicGPT 扩展为层次结构是否能提升主题的粒度和实用性？
RQ5种子主题的质量与数量对主题一致性与对齐度的影响是什么？

主要发现

TopicGPT 在 Wiki 和 Bills 数据集上相对于 LDA 和 BERTopic，与真实标签具有更高的主题对齐度（如 Wiki P1=0.73, ARI=0.58, NMI=0.71；Bills P1=0.57, ARI=0.42, NMI=0.47，在默认设置下）。
通过精炼提高可解释性并减少不一致主题（如精炼后 Wiki P1=0.74, ARI=0.60；精炼后 Bills P1=0.57, ARI=0.40）。
TopicGPT 分配在不同提示和数据样本下保持稳定，且在对齐度指标上表现可与 LDA 相当或更好；两次运行该管道可获得很高的稳定性（P1=0.95, ARI=0.92, NMI=0.92）。
开源 LLM 可以处理主题分配（例如 Mistral-7B-Instruct 在分配方面表现相当不错），但与 GPT-4 相比，开源模型在主题生成方面困难。
TopicGPT 在未精炼和精炼两种形式下，与真实主题相比，在语义上更靠近 LDA，且精炼显著降低错配主题。
分层的 TopicGPT 可以生成以父主题及其文档为基础的有信息量的子主题，从而实现更丰富的分析。

Figure 2: The number of topics generated over documents processed in the Bills and Wiki corpus. The grey line indicates the number of expected topics, simulated using the empirical distribution of ground-truth topics for the datasets. For both datasets, we see a similar pattern - after a "topic drou

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。