[论文解读] Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model
本研究评估 ClinicLLM(基于四家医院的临床笔记训练)在 30 天再入院预测中的表现,分析跨医院与患者群体的泛化性,并比较微调策略以提高泛化能力。
Advances in large language models (LLMs) provide new opportunities in healthcare for improved patient care, clinical decision-making, and enhancement of physician and administrator workflows. However, the potential of these models importantly depends on their ability to generalize effectively across clinical environments and populations, a challenge often underestimated in early development. To better understand reasons for these challenges and inform mitigation approaches, we evaluated ClinicLLM, an LLM trained on [HOSPITAL]'s clinical notes, analyzing its performance on 30-day all-cause readmission prediction focusing on variability across hospitals and patient characteristics. We found poorer generalization particularly in hospitals with fewer samples, among patients with government and unspecified insurance, the elderly, and those with high comorbidities. To understand reasons for lack of generalization, we investigated sample sizes for fine-tuning, note content (number of words per note), patient characteristics (comorbidity level, age, insurance type, borough), and health system aspects (hospital, all-cause 30-day readmission, and mortality rates). We used descriptive statistics and supervised classification to identify features. We found that, along with sample size, patient age, number of comorbidities, and the number of words in notes are all important factors related to generalization. Finally, we compared local fine-tuning (hospital specific), instance-based augmented fine-tuning and cluster-based fine-tuning for improving generalization. Among these, local fine-tuning proved most effective, increasing AUC by 0.25% to 11.74% (most helpful in settings with limited data). Overall, this study provides new insights for enhancing the deployment of large language models in the societally important domain of healthcare, and improving their performance for broader populations.
研究动机与目标
- 评估 ClinicLLM 在系统内跨医院及跨患者群体(保险、种族、年龄、共病情况)上的泛化能力。
- 识别推动泛化差异的因素(样本量、临床笔记内容、患者特征、医院特征)。
- 评估改进泛化的策略:本地医院特定微调、基于实例的增强,以及聚类基于微调。
- 为在多样化的医疗环境中部署临床大语言模型提供可操作的见解。
提出的方法
- 在四家医院的临床笔记上使用 109M 参数的 BERT-base 架构与 MLM 目标对 ClinicLLM 进行预训练。
- 在 History and Physical 笔记上对 ClinicLLM 进行微调,使用二进制再入院标签,采用 80-10-10 的训练-验证-测试划分和一个时序测试集。
- 评估全局微调(所有笔记)以及医院特定的本地微调、基于实例的增强微调(按嵌入相似度匹配样本)、以及聚类基微调(UMAP 降维 + K-means 聚类)。
- 以 AUC、AUPR 和 ECE 作为主要指标;评估跨医院、保险类型、种族、年龄和共病水平的泛化。
- 通过描述性统计、困惑度分析以及基于决策树的聚类来识别影响泛化的关键特征。
实验结果
研究问题
- RQ1ClinicLLM 在跨医院和患者子组的 30 天全因再入院预测中的泛化能力如何?
- RQ2与缺乏泛化相关的主要因素有哪些(样本量、笔记长度、年龄、共病、保险、种族)?
- RQ3微调策略(本地、基于实例的增强、聚类基)是否能提升泛化性,以及提升的程度如何?
- RQ4对于数据受限的医院,哪种策略在 AUC 上带来最大的相对提升?
主要发现
| Group | Item | AUC (%) | AUPR (%) | ECE | Readmission Rate (%) | Sample Size |
|---|---|---|---|---|---|---|
| Hospital | Hospital 1 | 74.60 | 34.10 | 0.21 | 14.80 | 102,275 |
| Hospital | Hospital 2 | 73.04 | 29.69 | 0.22 | 13.70 | 51,545 |
| Hospital | Hospital 3 | 69.90 | 20.70 | 0.27 | 9.70 | 4,502 |
| Hospital | Hospital 4 | 51.20 | 14.40 | 0.42 | 14.40 | 3,451 |
| Insurance Type | Government | 65.15 | 32.72 | 0.22 | 20.30 | 54,705 |
| Insurance Type | Private | 76.43 | 30.01 | 0.22 | 11.20 | 105,328 |
| Insurance Type | Self-Pay | 77.78 | 13.02 | 0.38 | 6.30 | 1,257 |
| Insurance Type | Other | 64.03 | 16.71 | 0.35 | 16.80 | 483 |
| Race Group | White | 72.68 | 30.06 | 0.22 | 14.40 | 89,273 |
| Race Group | Black | 71.71 | 33.10 | 0.21 | 15.80 | 19,207 |
| Race Group | Asian | 76.56 | 33.84 | 0.19 | 23.00 | 16,592 |
| Race Group | American Indian or Alaska Native | 81.27 | 34.03 | 0.24 | 7.20 | 1,068 |
| Race Group | Native Hawaiian or Other Pacific Islander | 57.82 | 8.96 | 0.42 | 9.20 | 704 |
| Race Group | Unknown | 75.10 | 31.23 | 0.22 | 14.00 | 34,929 |
| Age Group | Under 18 | 75.21 | 26.53 | 0.21 | 4.50 | 24,147 |
| Age Group | Young Adult (18-35) | 80.81 | 23.88 | 0.26 | 8.50 | 16,707 |
| Age Group | Adult (35-60) | 74.69 | 31.07 | 0.21 | 11.30 | 40,937 |
| Age Group | Above 60 | 64.75 | 32.06 | 0.22 | 20.00 | 79,858 |
| Comorbidities | Level 1 (Low) | 74.76 | 24.46 | 0.25 | 9.40 | 110,258 |
| Comorbidities | Level 2 (Moderate) | 66.86 | 33.69 | 0.22 | 20.30 | 218,30 |
| Comorbidities | Level 3 (High) | 61.43 | 37.93 | 0.20 | 27.10 | 251,60 |
| Comorbidities | Level 4 (Severe) | 58.08 | 43.25 | 0.19 | 33.00 | 4,525 |
- 医院层面的泛化不均衡;医院 3 和医院 4 的 AUC 明显低于医院 1,在时间序列测试中分别为 69.90%、51.20%。
- 保险与种族组的表现存在差异,其中政府及其他/不明确保险的 AUC 低于自费;亚裔和美洲原住民/阿拉斯加原住民组的 AUC 相对较高而夏威夷原住民/其他太平洋岛民组表现较差。
- 年龄对泛化影响显著,60 岁以上组的 AUC 为 64.75%,是各组中最低的。
- 共病水平越高,AUC 越低(CCI Level 3–4 为 61.43%、58.08%),但 AUPR 随共病水平上升而增加。
- 本地医院特定微调在相对提升 AUC 方面获得最大收益(对医院 4 最高可达 11.74%),并在跨医院间实现更好的标定。
- 与本地微调相比,基于实例的增强微调与聚类微调提供的提升较小或变动不定。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。