QUICK REVIEW

[论文解读] Automated Extraction of Unstructured Post-SBRT Toxicity Data from Radiology Reports Using Large Language Models

Justin Pijanowski, Yakout Mezgueldi|arXiv (Cornell University)|Feb 26, 2026

Topic Modeling被引用 0

一句话总结

该研究展示了使用经过提示工程的 Llama 3.3-70-B-Instruct 从放射科报告中提取放疗后毒性和进展结果，达到对临床数据整理的可行性能。

ABSTRACT

We evaluated the viability of using a Large Language Model (LLM) to extract patient-specific specific toxicity and progression outcomes from unstructured radiology reports. We retrospectively extracted 160 follow-up CT and PET/CT electronic medical record notes for patients treated with lung stereotactic body radiotherapy (SBRT) at our institution from January 2017 through December 2023. Using the Llama 3.3-70-B-Instruct LLM, we engineered prompts to extract four clinical endpoints from each radiology report: locoregional progression, distant progression, radiation-induced fibrosis, and radiation-induced rib fractures. Progression endpoints were classified as yes, no, or maybe, while fibrosis and rib fractures were binary (yes or no). Ground truth labels were defined using two-grader consensus for the 60-note training set, used for prompt development, and a three-grader majority vote for the 100-note test set. LLM performance was evaluated using sensitivity, specificity, and accuracy. As detailed by our evaluation metrics, the strong performance of our methods demonstrates the viability of using prompt-engineered LLMs to extract radiation-toxicities and progression classification from radiology reports.

研究动机与目标

推动在 SBRT 之后的非结构化放射科记录中自动提取结构化毒性与进展数据。
开发 prompts 将放射科语言映射到四个临床终点（局部进展、远处进展、纤维化、肋骨骨折）。
通过多 annotator 共识创建 ground-truth 标签，以训练和测试 prompts。
使用标准指标评估 LLM 的性能，以评估在放射治疗毒性监测中的可行性。

提出的方法

使用来自 SBRT 患者的 160 份随访 CT 与 PET/CT 记录（2017–2023）。
应用 Llama 3.3-70-B-Instruct 结合经过工程化的 prompts，从每份报告中提取四个终点。
终点：局部进展、远处进展（是/否/可能）、纤维化（是/否）、肋骨骨折（是/否）。
ground truth 通过训练集 60 份笔记采用两人分级共识、测试集 100 份笔记采用三人分级多数表决来建立。
以灵敏度、特异度和准确度评估性能。

实验结果

研究问题

RQ1提示工程的 LLM 能否从非结构化放射科报告中准确提取放疗后毒性与进展终点？
RQ2对于每个终点，LLM 基于提取的灵敏度、特异度和准确度是多少？
RQ3是否具有足够的可靠性，可以用于临床毒性监测与数据整理？

主要发现

LLM 基于提取在各终点上按所报道的评估指标显示出较强的性能。
ground-truth 标注使用了稳健的多评审共识（训练集为两评审，测试集为三评审）。
该方法支持在放射科报告中使用经过提示工程的 LLM 提取放射治疗毒性和进展分类的可行性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。