Skip to main content
QUICK REVIEW

[论文解读] TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision

Yunyi Zhang, Ruozhen Yang|arXiv (Cornell University)|Feb 29, 2024
Text and Document Classification Technologies被引用 7
一句话总结

TELEClass 通过语料驱动的主题术语丰富标签分类法,并利用 LLM 进行核心类别标注和基于路径的数据增强,以实现有效的弱监督分层文本分类。它在两个公开数据集上超越了以往的弱监督和零样本方法。

ABSTRACT

Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy, which is a fundamental web text mining task with broad applications such as web content analysis and semantic indexing. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with a minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, large language models (LLM) have shown competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification, which combines the general knowledge of LLMs and task-specific features mined from an unlabeled corpus. TELEClass automatically enriches the raw taxonomy with class-indicative features for better label space understanding and utilizes novel LLM-based data annotation and generation methods specifically tailored for the hierarchical setting. Experiments show that TELEClass can significantly outperform previous baselines while achieving comparable performance to zero-shot prompting of LLMs with drastically less inference cost.

研究动机与目标

  • 在仅使用节点名称作为监督信号的极小监督下,提出分层文本分类的动机。
  • 用语料派生的主题术语丰富分类法,以提升伪标签质量。
  • 利用大型语言模型实现基于分类法的标注和基于路径的数据增强。
  • 训练一个带有核心类别及生成的伪数据的多标签分类器,以覆盖完整的分类法。

提出的方法

  • 通过自顶向下的候选搜索与LLM选择,利用LLM增强的核心类别标注来识别文档的核心类别。
  • 基于语料的分类法丰富,从语料中挖掘与类别相关的主题术语并扩充分类法。
  • 使用嵌入式文档-类别匹配在丰富的分类法下对核心类别进行精炼,以实现跨文档的可比性。
  • 对于每个从根到叶的路径,使用LLM生成的伪文档进行基于路径的数据增强,以确保覆盖整个分类法。
  • 使用核心标签和生成的伪标签,训练一个带对数双线性匹配网络的多标签文本分类器。
Figure 1 . An example document tagged with 3 classes. We automatically enrich each node with class-indicative terms and utilize LLMs to facilitate classification.
Figure 1 . An example document tagged with 3 classes. We automatically enrich each node with class-indicative terms and utilize LLMs to facilitate classification.

实验结果

研究问题

  • RQ1是否可以在仅使用类别名称的极小监督下有效学习分层文本分类?
  • RQ2基于语料的分类法丰富是否在弱监督设置下提升伪标签质量和最终性能?
  • RQ3如何整合 LLM 来提升核心类别标注并生成分类法感知的伪文档?
  • RQ4基于路径的数据增强对覆盖率和在大规模分类法上的准确度有何影响?

主要发现

  • TELEClass 在 Amazon-531 和 DBPedia-298 的零-shot 与弱监督基线中取得最佳性能。
  • 分类法丰富和基于路径的数据生成带来互补收益,丰富帮助较低级别的区分,生成提升覆盖,尤其是在 Amazon-531 上。
  • 消融研究显示 Gen-Only、NoEnrich 和 NoGen 变体;全组件的 TELEClass 产生最强结果,丰富与生成的相对贡献因数据集而异。
  • 与 GPT-3.5-turbo 提示相比,经过谨慎设计、带有分类法引导与增强的 TELEClass 在分层分类精度上具有更优的表现。
  • 完全监督训练总体上仍然最强,但在极小监督下 TELEClass 将差距显著缩小。
Figure 2 . Overview of the TELEClass framework.
Figure 2 . Overview of the TELEClass framework.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。