QUICK REVIEW

[논문 리뷰] TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision

Yunyi Zhang, Ruozhen Yang|arXiv (Cornell University)|2024. 02. 29.

Text and Document Classification Technologies인용 수 7

한 줄 요약

TELEClass는 코퍼스 기반의 주제 용어로 라벨 분류 체계를 풍부하게 하고 핵심 클래스를 주석하기 위해 LLM을 활용하며, 경로 기반 데이터 확장을 통해 약지도 학습으로도 효과적인 계층 텍스트 분류를 가능하게 한다. 두 개의 공개 데이터셋에서 이전의 약지도 및 제로샷 방법보다 우수한 성능을 보인다.

ABSTRACT

Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy, which is a fundamental web text mining task with broad applications such as web content analysis and semantic indexing. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with a minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, large language models (LLM) have shown competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification, which combines the general knowledge of LLMs and task-specific features mined from an unlabeled corpus. TELEClass automatically enriches the raw taxonomy with class-indicative features for better label space understanding and utilizes novel LLM-based data annotation and generation methods specifically tailored for the hierarchical setting. Experiments show that TELEClass can significantly outperform previous baselines while achieving comparable performance to zero-shot prompting of LLMs with drastically less inference cost.

연구 동기 및 목표

노드 이름만을 감독으로 사용하여 최소 감독 하의 계층 텍스트 분류를 추진한다.
코퍼스에서 파생된 주제 용어로 분류 체계를 보강하여 의사 라벨 품질을 향상시킨다.
분류 체계 가이드 주석 및 경로 기반 데이터 확장을 위해 대형 언어 모델을 활용한다.
핵심 클래스와 생성된 의사 데이터를 포함하여 전체 분류 체계를 포괄하는 다중 라벨 분류기를 학습한다.

제안 방법

LLM-enhanced core class annotation to identify document core classes via top-down candidate search and LLM selection.
Corpus-based taxonomy enrichment to mine class-indicative topical terms from the corpus and augment the taxonomy.
Core class refinement with enriched taxonomy using embedding-based document-class matching for cross-document comparability.
Path-based data augmentation with LLM-generated pseudo-documents for every root-to-leaf path to ensure taxonomy-wide coverage.
Train a multi-label text classifier with a log-bilinear matching network using core and generated pseudo-labels.

Figure 1 . An example document tagged with 3 classes. We automatically enrich each node with class-indicative terms and utilize LLMs to facilitate classification.

실험 결과

연구 질문

RQ1Can hierarchical text classification be effectively learned with minimal supervision using only class names?
RQ2Does corpus-based taxonomy enrichment improve pseudo-label quality and final performance in weakly-supervised settings?
RQ3How can LLMs be integrated to enhance core-class annotation and generate taxonomy-aware pseudo-documents?
RQ4What is the impact of path-based data augmentation on coverage and accuracy across a large taxonomy?

주요 결과

TELEClass achieves the best performance among zero-shot and weakly-supervised baselines on Amazon-531 and DBPedia-298.
Taxonomy enrichment and path-based data generation contribute complementary gains, with enrichment aiding lower-level distinctions and generation improving coverage, especially on Amazon-531.
Ablation studies show Gen-Only, NoEnrich, and NoGen variants; TELEClass with all components yields the strongest results, and the relative contribution of enrichment vs. generation varies by dataset.
Compared to GPT-3.5-turbo prompting, carefully designed TELEClass with taxonomy guidance and augmentation yields superior hierarchical classification accuracy.
Fully supervised training remains strongest overall, but TELEClass narrows the gap significantly under minimal supervision.

Figure 2 . Overview of the TELEClass framework.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.