Skip to main content
QUICK REVIEW

[논문 리뷰] TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision

Yunyi Zhang, Ruozhen Yang|arXiv (Cornell University)|2024. 02. 29.
Text and Document Classification Technologies인용 수 7
한 줄 요약

TELEClass는 코퍼스 기반의 주제 용어로 라벨 분류 체계를 풍부하게 하고 핵심 클래스를 주석하기 위해 LLM을 활용하며, 경로 기반 데이터 확장을 통해 약지도 학습으로도 효과적인 계층 텍스트 분류를 가능하게 한다. 두 개의 공개 데이터셋에서 이전의 약지도 및 제로샷 방법보다 우수한 성능을 보인다.

ABSTRACT

Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy, which is a fundamental web text mining task with broad applications such as web content analysis and semantic indexing. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with a minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, large language models (LLM) have shown competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification, which combines the general knowledge of LLMs and task-specific features mined from an unlabeled corpus. TELEClass automatically enriches the raw taxonomy with class-indicative features for better label space understanding and utilizes novel LLM-based data annotation and generation methods specifically tailored for the hierarchical setting. Experiments show that TELEClass can significantly outperform previous baselines while achieving comparable performance to zero-shot prompting of LLMs with drastically less inference cost.

연구 동기 및 목표

  • 노드 이름만을 감독으로 사용하여 최소 감독 하의 계층 텍스트 분류를 추진한다.
  • 코퍼스에서 파생된 주제 용어로 분류 체계를 보강하여 의사 라벨 품질을 향상시킨다.
  • 분류 체계 가이드 주석 및 경로 기반 데이터 확장을 위해 대형 언어 모델을 활용한다.
  • 핵심 클래스와 생성된 의사 데이터를 포함하여 전체 분류 체계를 포괄하는 다중 라벨 분류기를 학습한다.

제안 방법

  • LLM-enhanced core class annotation to identify document core classes via top-down candidate search and LLM selection.
  • Corpus-based taxonomy enrichment to mine class-indicative topical terms from the corpus and augment the taxonomy.
  • Core class refinement with enriched taxonomy using embedding-based document-class matching for cross-document comparability.
  • Path-based data augmentation with LLM-generated pseudo-documents for every root-to-leaf path to ensure taxonomy-wide coverage.
  • Train a multi-label text classifier with a log-bilinear matching network using core and generated pseudo-labels.
Figure 1 . An example document tagged with 3 classes. We automatically enrich each node with class-indicative terms and utilize LLMs to facilitate classification.
Figure 1 . An example document tagged with 3 classes. We automatically enrich each node with class-indicative terms and utilize LLMs to facilitate classification.

실험 결과

연구 질문

  • RQ1Can hierarchical text classification be effectively learned with minimal supervision using only class names?
  • RQ2Does corpus-based taxonomy enrichment improve pseudo-label quality and final performance in weakly-supervised settings?
  • RQ3How can LLMs be integrated to enhance core-class annotation and generate taxonomy-aware pseudo-documents?
  • RQ4What is the impact of path-based data augmentation on coverage and accuracy across a large taxonomy?

주요 결과

  • TELEClass achieves the best performance among zero-shot and weakly-supervised baselines on Amazon-531 and DBPedia-298.
  • Taxonomy enrichment and path-based data generation contribute complementary gains, with enrichment aiding lower-level distinctions and generation improving coverage, especially on Amazon-531.
  • Ablation studies show Gen-Only, NoEnrich, and NoGen variants; TELEClass with all components yields the strongest results, and the relative contribution of enrichment vs. generation varies by dataset.
  • Compared to GPT-3.5-turbo prompting, carefully designed TELEClass with taxonomy guidance and augmentation yields superior hierarchical classification accuracy.
  • Fully supervised training remains strongest overall, but TELEClass narrows the gap significantly under minimal supervision.
Figure 2 . Overview of the TELEClass framework.
Figure 2 . Overview of the TELEClass framework.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.