QUICK REVIEW

[论文解读] Structured information extraction from complex scientific text with fine-tuned large language models

Alexander Dunn, John Dagdelen|arXiv (Cornell University)|Dec 10, 2022

Machine Learning in Materials Science被引用 65

一句话总结

本文提出一种使用对 GPT-3 进行微调的简单 seq2seq 方法，大约基于 ~500 对 prompt–completion，以执行文档级联合 NER 和关系抽取，针对科学文本中的复杂分层信息，从摘要和段落输出结构化的 JSON。

ABSTRACT

Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.

研究动机与目标

动机：在材料科学中从非结构化文本中提取并连接复杂科学信息的需求。
开发一个灵活的端到端 NERRE 方法，能够处理分层和多实体关系。
展示在结构化 prompt–completion 对上微调大型语言模型可在多个任务中实现准确的信息抽取。
展示输出既有自然英语也有结构化 JSON 格式，便于与数据库集成。

提出的方法

在 ~100–500 个文档完成示例上对 GPT-3 进行微调，以使用预定义输出模式执行文档级 NERRE。
使用人机协作工作流程，快速扩展训练数据，并用部分训练模型预填注释。
按任务模式将输出提供为英文句子或结构化 JSON（或嵌套 JSON）之一。
使用序列重构指标（完全匹配、Jaro-Winkler、可解析性）和信息抽取指标（严格词级匹配的实体三元组）进行评估。
可选的后处理可以将完成项转换为分层知识图谱。

实验结果

研究问题

RQ1微调后的大语言模型是否能够对复杂、分层的科学信息执行联合命名实体识别和关系抽取？
RQ2在材料科学的不同领域（掺杂、MOFs、一般材料）上，使用任务特定模式，方法的泛化能力有多好？
RQ3使用在环训练工作流在标注效率方面的实际提升是什么？
RQ4对提取信息的下游使用，哪种格式最能提供支持（自然语言 vs JSON vs 图结构）？

主要发现

该方法在材料科学的三个任务中：固态掺杂、MOFs 和通用材料信息，能够从摘要和段落中准确提取复杂知识。
对 GPT-3 进行 ~100–500 个 prompt–completion 示例微调，可以产生高质量的结构化输出，既是 JSON 也可以是英文句子。
在环标注方法将每个摘要的标注时间从约 100 秒减少到约 40 秒。
与 seq2rel 和基于 MatBERT 的基线相比，LLM-NERRE 方法在灵活、模式驱动的方式下，显示出稳健地捕捉实体和关系的能力。
该框架支持下游解码为分层图，并可与公开可用的 API 一起使用，以实现广泛的可访问性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。