QUICK REVIEW

[论文解读] CORAL: Expert-Curated medical Oncology Reports to Advance Language Model Inference

Madhumita Sushil, Vanessa E. Kennedy|arXiv (Cornell University)|Aug 7, 2023

Topic Modeling参考文献 30被引用 7

一句话总结

论文介绍了一个详细的肿瘤科注释方案，并在标注的乳腺癌和胰腺癌病程记录上对零-shot LLM（GPT-4、GPT-3.5-turbo、FLAN-UL2）进行了评估，GPT-4实现了最佳总体性能。

ABSTRACT

Both medical care and observational studies in oncology require a thorough understanding of a patient's disease progression and treatment history, often elaborately documented in clinical notes. Despite their vital role, no current oncology information representation and annotation schema fully encapsulates the diversity of information recorded within these notes. Although large language models (LLMs) have recently exhibited impressive performance on various medical natural language processing tasks, due to the current lack of comprehensively annotated oncology datasets, an extensive evaluation of LLMs in extracting and reasoning with the complex rhetoric in oncology notes remains understudied. We developed a detailed schema for annotating textual oncology information, encompassing patient characteristics, tumor characteristics, tests, treatments, and temporality. Using a corpus of 40 de-identified breast and pancreatic cancer progress notes at University of California, San Francisco, we applied this schema to assess the zero-shot abilities of three recent LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) to extract detailed oncological history from two narrative sections of clinical progress notes. Our team annotated 9028 entities, 9986 modifiers, and 5312 relationships. The GPT-4 model exhibited overall best performance, with an average BLEU score of 0.73, an average ROUGE score of 0.72, an exact-match F1-score of 0.51, and an average accuracy of 68% on complex tasks (expert manual evaluation on subset). Notably, it was proficient in tumor characteristic and medication extraction, and demonstrated superior performance in relational inference like adverse event detection. However, further improvements are needed before using it to reliably extract important facts from cancer progress notes needed for clinical research, complex population management, and documenting quality patient care.

研究动机与目标

在临床笔记中阐明需要一个全面的肿瘤科信息表示的动机。
开发并应用用于标注肿瘤科文本信息（患者/肿瘤特征、检查、治疗、时序）的详细架构。
评估领先LLM在提取和推理肿瘤笔记方面的零-shot能力。
在实质上去识别的数据集上，使用自动化指标和专家评估来量化提取性能。

提出的方法

创建涵盖患者特征、肿瘤特征、检查、治疗和时序的注释架构。
从UCSF组装40份去识别的乳腺癌和胰腺癌病程记录语料库。
使用该架构标注9028个实体、9986个修饰语和5312个关系。
对三种LLM（GPT-4、GPT-3.5-turbo、FLAN-UL2）应用零-shot推理，从两个叙事部分提取肿瘤学史。
使用BLEU、ROUGE、精确匹配F1以及总体准确性，与专家手工注释进行对比评估；并进行专家子集评估。

实验结果

研究问题

RQ1零-shot LLM在使用CORAL架构从叙事病程记录中提取结构化肿瘤学史的能力如何？
RQ2在GPT-4、GPT-3.5-turbo和FLAN-UL2中，哪种模型在实体、关系和关系推断任务上的表现最好？
RQ3当前LLM在捕捉肿瘤特征、药物及不良事件关系等肿瘤学文档中的优势与局限性是什么？

主要发现

GPT-4在所评估的模型中实现了最强的总体性能。
GPT-4的平均BLEU分数：0.73。
GPT-4的平均ROUGE分数：0.72。
GPT-4的精确匹配F1分数：0.51。
GPT-4在复杂任务上的平均准确性：68%（专家子集评估）。
GPT-4在肿瘤特征和药物提取以及在关系推断（如不良事件检测）方面表现出熟练度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。