QUICK REVIEW

[论文解读] UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition

Wenxuan Zhou, Sheng Zhang|arXiv (Cornell University)|Aug 7, 2023

Topic Modeling被引用 25

一句话总结

UniversalNER 将 ChatGPT 风格的 NER 能力通过以任务为中心的指令微调蒸馏到更小的模型，在一个庞大的 UniNER 基准上实现开域 NER 的最先进水平，无需直接监督。

ABSTRACT

Large language models (LLMs) have demonstrated remarkable generalizability, such as understanding arbitrary entities and relations. Instruction tuning has proven effective for distilling LLMs into more cost-efficient models such as Alpaca and Vicuna. Yet such student models still trail the original LLMs by large margins in downstream applications. In this paper, we explore targeted distillation with mission-focused instruction tuning to train student models that can excel in a broad application class such as open information extraction. Using named entity recognition (NER) for case study, we show how ChatGPT can be distilled into much smaller UniversalNER models for open NER. For evaluation, we assemble the largest NER benchmark to date, comprising 43 datasets across 9 diverse domains such as biomedicine, programming, social media, law, finance. Without using any direct supervision, UniversalNER attains remarkable NER accuracy across tens of thousands of entity types, outperforming general instruction-tuned models such as Alpaca and Vicuna by over 30 absolute F1 points in average. With a tiny fraction of parameters, UniversalNER not only acquires ChatGPT's capability in recognizing arbitrary entity types, but also outperforms its NER accuracy by 7-9 absolute F1 points in average. Remarkably, UniversalNER even outperforms by a large margin state-of-the-art multi-task instruction-tuned systems such as InstructUIE, which uses supervised NER examples. We also conduct thorough ablation studies to assess the impact of various components in our distillation approach. We release the distillation recipe, data, and UniversalNER models to facilitate future research on targeted distillation.

研究动机与目标

推动有针对性的蒸馏，以缩小大型语言模型与小型指令模型在广泛应用场景（以开放信息抽取和命名实体识别为例）之间的性能差距。
研究如何从未标注的网页文本中生成多样化的指令微调数据，以训练一个更小的模型识别任意实体类型。
组建一个全面的通用 NER 基准，用以评估蒸馏方法在跨领域、跨类型上的泛化能力。

提出的方法

使用 ChatGPT 为来自 Pile 语料库的段落生成 NER 注释，创建一个多样化的未标注监督信号。
在较小模型（LLaMA-2 系列）上应用以任务为焦点的指令微调，使用对话风格模板从段落中按类型提取实体（每次查询一个类型或在一个查询中提取所有类型）。
通过包含段落中不存在的实体类型来实现负采样，以模拟开放世界条件。
使用数据集特定的指令模板以协调不同 NER 数据集之间的标签语义并减少冲突；可选地增加定义以提高对改写的鲁棒性。
可选地使用人工标注数据进行有监督微调，以提升域内和域外的性能；分别评估零-shot 和有监督两种模式。
在包含 9 个领域（如生物医学、编程、社交媒体、法律、金融等）的 43 数据集组成的 UniversalNER 基准上进行构建与评估。

实验结果

研究问题

RQ1通过以任务为焦点的指令微调引导的对 LLM 的定向蒸馏，是否能够在多样的实体类型和领域中复制或超过 LLM 的开域 NER 能力？
RQ2数据构造选择（输入抽样、负采样和模板设计）如何影响蒸馏模型的零-shot NER 表现？
RQ3域覆盖、数据集特定标签协调以及部分匹配评估对 UniversalNER 的有效性有何影响？
RQ4在零-shot 和有监督设置下，UniNER 与强指令微调和有监督系统（如 ChatGPT、Vicuna、InstructUIE）相比如何？
RQ5带有人类标注的有监督微调是否进一步改善开域 NER 的跨域泛化？

主要发现

蒸馏后的 UniNER 模型（7B、13B）在 UniNER 基准中的零-shot NER 平均水平上整体上优于 ChatGPT。
UniNER-13B 实现的平均 F1 高于 UniNER-7B，表明更大蒸馏容量的收益。
UniNER 在跨多个领域的零-shot和有监督设置下，平均水平超过 Vicuna 和 InstructUIE。
基于频率的负采样选择对于提升指令微调中的性能至关重要。
数据集特定的模板通常会提升性能，特别是对在多个数据集中存在重叠的标签。
在有监督的域内评估中，UniNER-7B 在 20 个数据集上达到 84.78% 的平均 F1，超过 BERT-base 和 InstructUIE-11B；持续的有监督微调在域外评估中达到 60.0% 的平均 F1。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。