QUICK REVIEW

[논문 리뷰] UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition

Wenxuan Zhou, Sheng Zhang|arXiv (Cornell University)|2023. 08. 07.

Topic Modeling인용 수 25

한 줄 요약

유니버설NER는 Mission-focused 지시 조정을 통해 ChatGPT와 유사한 NER 능력을 더 작은 모델로 증류하여, 직접 감독 없이도 대규모 UniNER 벤치마크에서 오픈 도메인 NER의 최첨단 성능을 달성합니다.

ABSTRACT

Large language models (LLMs) have demonstrated remarkable generalizability, such as understanding arbitrary entities and relations. Instruction tuning has proven effective for distilling LLMs into more cost-efficient models such as Alpaca and Vicuna. Yet such student models still trail the original LLMs by large margins in downstream applications. In this paper, we explore targeted distillation with mission-focused instruction tuning to train student models that can excel in a broad application class such as open information extraction. Using named entity recognition (NER) for case study, we show how ChatGPT can be distilled into much smaller UniversalNER models for open NER. For evaluation, we assemble the largest NER benchmark to date, comprising 43 datasets across 9 diverse domains such as biomedicine, programming, social media, law, finance. Without using any direct supervision, UniversalNER attains remarkable NER accuracy across tens of thousands of entity types, outperforming general instruction-tuned models such as Alpaca and Vicuna by over 30 absolute F1 points in average. With a tiny fraction of parameters, UniversalNER not only acquires ChatGPT's capability in recognizing arbitrary entity types, but also outperforms its NER accuracy by 7-9 absolute F1 points in average. Remarkably, UniversalNER even outperforms by a large margin state-of-the-art multi-task instruction-tuned systems such as InstructUIE, which uses supervised NER examples. We also conduct thorough ablation studies to assess the impact of various components in our distillation approach. We release the distillation recipe, data, and UniversalNER models to facilitate future research on targeted distillation.

연구 동기 및 목표

광범위한 응용 범주(예: 오픈 정보 추출과 NER)에서 대형 언어 모델과 소형 지시형 모델 간 성능 격차를 해소하기 위한 표적 증류의 동기를 부여한다.
라벨이 없는 웹 텍스트로부터 다양한 지시 미세조정 데이터를 생성하여 더 작은 모델이 임의의 엔티티 유형을 인식하도록 가르치는 방법을 조사한다.
증류 방법의 도메인 간, 유형 간 일반화를 평가하기 위한 포괄적인 Universal NER 벤치마크를 구성한다.]
method2-1
method2-2
method2-3
method2-4
method2-5
method2-6
research_questions: [
Can targeted distillation from an LLM, guided by mission-focused instruction tuning, replicate or surpass the LLM's open-domain NER capabilities across diverse entity types and domains?
How do data construction choices (input sampling, negative sampling, and template design) affect zero-shot NER performance of distilled models?
What is the impact of domain coverage, dataset-specific label harmonization, and partial-match evaluation on the effectiveness of UniversalNER?
How does UniNER compare to strong instruction-tuned and supervised systems (e.g., ChatGPT, Vicuna, InstructUIE) in zero-shot and supervised settings?
Does supervised finetuning with human annotations further improve cross-domain generalization for open-domain NER?]
key_findings: ["Distilled UniNER models (7B, 13B) outperform ChatGPT on average across the UniNER benchmark in zero-shot NER. ","UniNER-13B achieves higher average F1 than UniNER-7B, indicating benefit from larger distilled capacity.","UniNER outperforms Vicuna and InstructUIE on average in zero-shot and supervised settings across multiple domains.","Negative sampling with frequency-based selection is crucial for improving performance in instruction tuning.","Dataset-specific templates generally improve performance, especially for labels with overlaps across datasets.","In supervised in-domain evaluation, UniNER-7B achieves 84.78% average F1 on 20 datasets, surpassing BERT-base and InstructUIE-11B; continual supervised fine-tuning reaches 60.0% average F1 on out-of-domain evaluation."]
table_headers: []
table_rows: []

제안 방법

Pile 코퍼스에서 샘플링한 구문에 대해 ChatGPT를 사용하여 NER 주석을 생성하고, 다양하고 라벨이 없는 감독 신호를 만든다.
대화형 템플릿을 사용하여 작은 모델(LLaMA-2 계열)에 미션 중심의 지시 미세조정을 적용하고 구문에서 유형별로 엔티티를 추출한다(쿼리당 한 유형 또는 한 쿼리에서 모든 유형).
구문에 나타나지 않는 엔티티 유형을 포함하여 오픈 월드 조건을 모사하는 음성 샘플링을 포함한다.
서로 다른 NER 데이터셋 간의 레이블 의미를 조화시키고 충돌을 줄이기 위해 데이터셋별 지시 템플릿을 사용하고, 필요 시 동의어를 늘리기 위해 정의를 보강한다.
선택적으로 사람 주석 데이터로 지도학습 미세조정을 수행하여 도메인 내외 성능을 향상시키고, 제로샷과 감독 하의 설정을 별도로 평가한다.
생성 UniversalNER 벤치마크(9개 도메인에 걸친 43개 데이터셋)에서 구성하고 평가한다 (예: 생물의학, 프로그래밍, 소셜 미디어, 법, 금융).

실험 결과

연구 질문

RQ1미션 중심의 지시 미세조정에 따라 LLM의 표적 증류가 다양한 엔티티 유형과 도메인에 걸쳐 LLM의 오픈 도메인 NER 능력을 재현하거나 능가할 수 있는가?
RQ2입력 샘플링, 음수 샘플링, 템플릿 설계와 같은 데이터 구성 선택이 증류 모델의 제로샷 NER 성능에 어떤 영향을 미치는가?
RQ3도메인 커버리지, 데이터셋별 라벨 조화, 부분 매치 평가가 UniversalNER의 효과성에 어떤 영향을 미치는가?
RQ4UniNER가 제로샷 및 감독 설정에서 강력한 지시 조정 및 감독 시스템(ChatGPT, Vicuna, InstructUIE 등)과 어떻게 비교되는가?
RQ5사람 주석이 포함된 감독 미세조정이 오픈 도메인 NER의 교차 도메인 일반화를 더 향상시키는가?]
RQ6key_findings:["Distilled UniNER models (7B, 13B) outperform ChatGPT on average across the UniNER benchmark in zero-shot NER. ","UniNER-13B achieves higher average F1 than UniNER-7B, indicating benefit from larger distilled capacity.","UniNER outperforms Vicuna and InstructUIE on average in zero-shot and supervised settings across multiple domains.","Negative sampling with frequency-based selection is crucial for improving performance in instruction tuning.","Dataset-specific templates generally improve performance, especially for labels with overlaps across datasets.","In supervised in-domain evaluation, UniNER-7B achieves 84.78% average F1 on 20 datasets, surpassing BERT-base and InstructUIE-11B; continual supervised fine-tuning reaches 60.0% average F1 on out-of-domain evaluation."]
RQ7table_headers: []
RQ8table_rows: []

주요 결과

Distilled UniNER models (7B, 13B) outperform ChatGPT on average across the UniNER benchmark in zero-shot NER.
UniNER-13B achieves higher average F1 than UniNER-7B, indicating benefit from larger distilled capacity.
UniNER outperforms Vicuna and InstructUIE on average in zero-shot and supervised settings across multiple domains.
Negative sampling with frequency-based selection is crucial for improving performance in instruction tuning.
Dataset-specific templates generally improve performance, especially for labels with overlaps across datasets.
In supervised in-domain evaluation, UniNER-7B achieves 84.78% average F1 on 20 datasets, surpassing BERT-base and InstructUIE-11B; continual supervised fine-tuning reaches 60.0% average F1 on out-of-domain evaluation.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.