QUICK REVIEW

[论文解读] Revisiting Relation Extraction in the era of Large Language Models

Somin Wadhwa, Silvio Amir|arXiv (Cornell University)|May 8, 2023

Topic Modeling被引用 12

一句话总结

该论文评估 GPT-3 和 Flan-T5 通过生成实现端到端关系抽取（RE），显示 GPT-3 的少样本提示接近 SOTA；以及在微调后，Flan-T5 与 GPT-3 生成的 Chain-of-Thought 解释达到 SOTA。它还解决了生成式 RE 的评估挑战。

ABSTRACT

Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a \emph{sequence-to-sequence} task, linearizing relations between entities as target strings to be generated conditioned on the input. Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision. We address issues inherent to evaluating generative approaches to RE by doing human evaluations, in lieu of relying on exact matching. Under this refined evaluation, we find that: (1) Few-shot prompting with GPT-3 achieves near SOTA performance, i.e., roughly equivalent to existing fully supervised models; (2) Flan-T5 is not as capable in the few-shot setting, but supervising and fine-tuning it with Chain-of-Thought (CoT) style explanations (generated via GPT-3) yields SOTA results. We release this model as a new baseline for RE tasks.

研究动机与目标

评估非常大语言模型（LLMs）通过生成进行端到端关系抽取的能力。
使用 GPT-3 的少样本提示评估，并与标准 RE 数据集上的监督基线进行比较。
调查生成式 RE 的评估挑战，并提出以人为中心的评估方法以纠正严格匹配偏差。
提出一种训练策略，使 Flan-T5 通过带有 Chain-of-Thought (CoT) 解释的微调达到 SOTA。
提供一个实用的、开放模型的 RE 基线，使用通过 GPT-3 生成的 CoT 解释微调 Flan-T5。

提出的方法

将 RE 表述为条件文本生成，在给定上下文 C 和输入 x 的条件下输出线性化的关系三元组。
在 ADE、CoNLL04、NYT 数据集上使用 GPT-3（text-davinci-002）的上下文学习，配以精心设计的提示。
收集人工注释以对比金标准目标评估生成输出，因为严格字符串匹配的评估脆弱性。
在标准 RE 监督上微调 Flan-T5 Large，另加上由 GPT-3 生成的 CoT 解释以提升性能。
使用 GPT-3 生成的 CoT 解释来监督 Flan-T5 的训练，比较标准监督与带 CoT 的监督。
使用 micro-F1 分数报告结果，并对输出与目标模式的一致性进行定性分析。

Figure 2: Examples of misclassified FPs and FNs from GPT-3 (generated under few-shot in-context prompting scheme) under traditional evaluation of generative output. In each instance, the entity-type of subject and object was correctly identified.

实验结果

研究问题

RQ1GPT-3 的少-shot 提示是否能够在标准数据集上达到近似最先进的 RE 性能？
RQ2Flan-T5 在少样本设置下是否能与或超越有监督的 RE 模型，CoT 解释是否能提升其性能？
RQ3对生成模型的 RE 输出在非精确输出格式下应如何评估，以及严格匹配带来哪些偏差？
RQ4用 GPT-3 生成的 CoT 解释对 Flan-T5 的训练是否在多个数据集上带来鲁棒的、状态-of-the-art 的 RE 性能？
RQ5是否可行通过 CoT 指导的监督建立一个更小的开源 RE 基线，以匹敌或超越更大模型？

主要发现

数据集	实体类型	关系类型	# 关系三元组	训练集	验证集	测试集	备注
ADE	2	1	4,272	–	–	–	Dataset characteristics
CoNLL04	4	5	922	231	288	–	Dataset characteristics
NYT	4	24	56,196	5,000	5,000	–	Dataset characteristics
DocRED	6	96	3,008	300	700	–	Dataset characteristics

少样本 GPT-3 实现近乎 SOTA 的性能，与最好的全监督模型相当，仅需数十个示例。
GPT-3 的 CoT 解释提高少样本性能并减少不符合输出。
Flan-T5（Large）在少样本设置下相对于 GPT-3 效果较差，但 Flan-T5 加上 GPT-3 生成的 CoT 解释达到 SOTA。
使用 CoT 解释微调 Flan-T5 在 ADE、CoNLL、NYT 数据集上获得显著增益（大约 5–10 微-F1 点），超越此前的全监督生成方法。
使用 CoT 生成的监督对 Flan-T5 提供了一个实用、训练更快、直接达到 SOTA 的 RE 路径，无需在推理时使用 GPT-3。
研究强调生成式 RE 的评估挑战，并证明仔细的人类注释能提升报告收益的可靠性。

Figure 3: We propose fine-tuning Flan-T5 (large) for relation extraction (RE) using standard supervision and Chain-of-Thought (CoT) reasoning elicited from GPT-3 for RE. This yields SOTA performance across all datasets considered, often by substantial margin ( $\sim$ 5 points absolute gain in F1).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。