QUICK REVIEW

[论文解读] Classifying Cancer Stage with Open-Source Clinical Large Language Models

Chia‐Hsuan Chang, Mary M. Lucas|arXiv (Cornell University)|Apr 2, 2024

Radiomics and Machine Learning in Medical Imaging被引用 5

一句话总结

该论文表明开源的临床大语言模型可以在没有带标签训练数据的情况下，从非结构化病理报告中提取癌症的TNM分期，在T、N、M类别上通过提示策略实现与微调基线相竞争的性能。

ABSTRACT

Cancer stage classification is important for making treatment and care management plans for oncology patients. Information on staging is often included in unstructured form in clinical, pathology, radiology and other free-text reports in the electronic health record system, requiring extensive work to parse and obtain. To facilitate the extraction of this information, previous NLP approaches rely on labeled training datasets, which are labor-intensive to prepare. In this study, we demonstrate that without any labeled training data, open-source clinical large language models (LLMs) can extract pathologic tumor-node-metastasis (pTNM) staging information from real-world pathology reports. Our experiments compare LLMs and a BERT-based model fine-tuned using the labeled data. Our findings suggest that while LLMs still exhibit subpar performance in Tumor (T) classification, with the appropriate adoption of prompting strategies, they can achieve comparable performance on Metastasis (M) classification and improved performance on Node (N) classification.

研究动机与目标

推动从非结构化病理报告中自动提取癌症TNM分期。
评估开源临床LLM在无标记训练数据情况下的pTNM分类能力。
比较LLM的提示策略，并将基准性能与微调模型进行对比。
评估在TCGA病理报告中对T、N、M类别以及不同癌种的鲁棒性。

提出的方法

使用TCGA病理报告（n=6,940 个具有真实标签）在不进行微调的条件下评估TNM分类。
比较三种开源LLM：Llama-2-70b-chat、ClinicalCamel-70B、Med42-70B，与一个微调的Clinical-BigBird基线进行对比。
应用三种提示策略：Zero-shot、Zero-shot Chain-of-Thoughts (ZS-COT)、Few-shots。
在模型输出后通过正则表达式模式提取TNM标签（T: T1–T4, N: N0–N3, M: M0–M1）。
使用自举法（B=500）计算95%的置信区间。

实验结果

研究问题

RQ1开源临床LLM是否能够在真实世界的病理报告中，在没有标注训练数据的情况下提取pTNM分期？
RQ2不同的提示策略如何影响T、N、M类别的TNM分类性能？
RQ3开源临床LLM与微调的Clinical-BigBird基线在pTNM提取上有何比较？
RQ4性能是否因癌种（如 BRCA、LUAD）或TNM类别而异？
RQ5在真实世界异质报告中部署开源LLM进行临床分期的优点和局限性是什么？

主要发现

开源LLM能够在不需要训练数据的情况下，从病理报告中提取pTNM分期。
ClinicalCamel-70B 和 Med42-70B 在零-shot 提示下的表现优于 Llama-2-70b-chat，并在N和M类别上达到与 Clinical-BigBird 相当甚至更好的宏F1。
零-shot 连锁思考提示在T、N、M的宏F1上相比单独零-shot在多个模型上有所提升。
少量样本提示通常不提升宏F1，且由于机构间数据集变异性，可能降低性能。
Med42-70B 采用 ZS-COT 或 FS 在 N 和 M 类别上表现强劲，在某些癌种分析中甚至超越 Clinical-BigBird。
M1（远处转移）仍然是所有模型中最具挑战性的类别，罕见类别的宏F1始终较低。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。