QUICK REVIEW

[论文解读] Information Extraction from Clinical Notes: Are We Ready to Switch to Large Language Models?

Yan Hu, Xu Zuo|arXiv (Cornell University)|Nov 15, 2024

Topic Modeling被引用 7

一句话总结

本文比较基于指令微调的 LLaMA 类大型语言模型与 BiomedBERT，在多机构数据集上进行临床 NER 与 RE 的信息提取，结果显示在资源有限和未见场景下，LLMs 可超越 BERT，但需要显著更多的资源且吞吐量较慢。

ABSTRACT

Backgrounds: Information extraction (IE) is critical in clinical natural language processing (NLP). While large language models (LLMs) excel on generative tasks, their performance on extractive tasks remains debated. Methods: We investigated Named Entity Recognition (NER) and Relation Extraction (RE) using 1,588 clinical notes from four sources (UT Physicians, MTSamples, MIMIC-III, and i2b2). We developed an annotated corpus covering 4 clinical entities and 16 modifiers, and compared instruction-tuned LLaMA-2 and LLaMA-3 against BERT in terms of performance, generalizability, computational resources, and throughput to BERT. Results: LLaMA models outperformed BERT across datasets. With sufficient training data, LLaMA showed modest improvements (1% on NER, 1.5-3.7% on RE); improvements were larger with limited training data. On unseen i2b2 data, LLaMA-3-70B outperformed BERT by 7% (F1) on NER and 4% on RE. However, LLaMA models required more computing resources and ran up to 28 times slower. We implemented "Kiwi," a clinical IE package featuring both models, available at https://kiwi.clinicalnlp.org/. Conclusion: This study is among the first to develop and evaluate a comprehensive clinical IE system using open-source LLMs. Results indicate that LLaMA models outperform BERT for clinical NER and RE but with higher computational costs and lower throughputs. These findings highlight that choosing between LLMs and traditional deep learning methods for clinical IE applications should remain task-specific, taking into account both performance metrics and practical considerations such as available computing resources and the intended use case scenarios.

研究动机与目标

评估指令微调的 LLaMA-2/LLaMA-3 在临床 NER 与 RE 上相对于 BiomedBERT 在多样数据源中的性能。
创建一个涵盖主要实体与修饰语的综合、多机构的临床 IE 语料库。
评估通用性、吞吐量、能耗和内存需求。
提供一个整合两大模型家族的开源临床 IE 流水线（Kiwi）。

提出的方法

构建包含 4 个数据集的临床 IE 语料库（UTP、MTSamples、MIMIC-III、i2b2），含 4 个主要实体和 16 个修饰语。
使用 PEFT（LoRA）和 4 位量化，对 LLaMA-2-chat 与 LLaMA-3-instruct 进行指令微调；在 NER 与 RE 上与 BiomedBERT 进行比较。
对 NER 与 RE 使用统一的基于 span 的指令格式。
使用精准匹配与放宽匹配的评估标准，以及跨机构泛化和资源消耗评估（GPU 小时、内存、能量）。
提供 Kiwi——一个将 LLaMA 与 BiomedBERT 模型结合的开源 IE 流水线。

实验结果

研究问题

RQ1指令微调的 LLaMA 模型在多数据源上是否优于 BiomedBERT 的临床 NER 与 RE？
RQ2对未见机构/数据集的泛化性能如何？
RQ3在临床 IE 中使用 LLM 与 BERT 的计算成本、吞吐量和能量消耗分别是多少？
RQ4单一流水线（Kiwi）在两大模型家族的实际部署中是否可行？

主要发现

Table 2: 精确 F1 分数（NER，RE）跨数据集及未见泛化情况（i2b2）
NER	LLaMA-2-7B	0.929	UTP	0.860	MTSamples	0.838	MIMIC-III	0.846	i2b2 (未见)
NER	LLaMA-2-13B	0.932	UTP	0.868	MTSamples	0.847	MIMIC-III	0.853	i2b2 (未见)
NER	LLaMA-2-70B	0.931	UTP	0.871	MTSamples	0.847	MIMIC-III	0.860	i2b2 (未见)
NER	LLaMA-3-8B	0.929	UTP	0.869	MTSamples	0.843	MIMIC-III	0.852	i2b2 (未见)
NER	LLaMA-3-70B	0.932	UTP	0.876	MTSamples	0.855	MIMIC-III	0.872	i2b2 (未见)
NER	BiomedBERT	0.921	UTP	0.833	MTSamples	0.810	MIMIC-III	0.798	i2b2 (未见)
RE	LLaMA-2-7B	0.916	UTP	0.785	MTSamples	0.823	MIMIC-III	0.823	i2b2 (未见)
RE	LLaMA-2-13B	0.915	UTP	0.793	MTSamples	0.833	MIMIC-III	0.833	i2b2 (未见)
RE	LLaMA-2-70B	0.918	UTP	0.795	MTSamples	0.850	MIMIC-III	0.850	i2b2 (未见)
RE	LLaMA-3-8B	0.936	UTP	0.787	MTSamples	0.859	MIMIC-III	0.859	i2b2 (未见)
RE	LLaMA-3-70B	0.937	UTP	0.795	MTSamples	0.858	MIMIC-III	0.858	i2b2 (未见)
RE	BiomedBERT	0.898	UTP	0.670	MTSamples	0.808	MIMIC-III	0.808	i2b2 (未见)

LLaMA 模型在跨数据集的精确 NER/RE 分数上始终优于 BiomedBERT，且在资源有限和未见场景尤为明显。
在数据充足的情况下，LLaMA 的增益较小（NER 约提升 1% 左右，RE 约提升 1.5–3.7%），但在低资源设置中的增益更大（NER 最高约提升 4.5%，RE 最高约提升 12.5%）。
在未见的 i2b2 数据上，LLaMA-3-70B 相较 BiomedBERT 在 NER 的 F1 提高超过 7%，在 RE 上提升约 4%。
LLaMA 模型需要显著更多的内存、GPU 时长与能量，推理速度较 BiomedBERT 慢（在某些情况下慢至最多 28 倍）。
Kiwi 提供一个开源、基于 Docker 的 IE 流水线，包含基于 LLaMA 与 BiomedBERT 的选项，便于实际使用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。