QUICK REVIEW

[論文レビュー] Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model

Salman Rahman, Lavender Yao Jiang|arXiv (Cornell University)|Feb 14, 2024

Machine Learning in Healthcare被引用数 8

ひとこと要約

この研究は、4つの病院の臨床ノートを用いて訓練された ClinicLLM を用いた30日再入院予測を評価し、病院間および患者グループ全体での一般化を分析し、一般化を改善するためのファインチューニング戦略を比較します。

ABSTRACT

Advances in large language models (LLMs) provide new opportunities in healthcare for improved patient care, clinical decision-making, and enhancement of physician and administrator workflows. However, the potential of these models importantly depends on their ability to generalize effectively across clinical environments and populations, a challenge often underestimated in early development. To better understand reasons for these challenges and inform mitigation approaches, we evaluated ClinicLLM, an LLM trained on [HOSPITAL]'s clinical notes, analyzing its performance on 30-day all-cause readmission prediction focusing on variability across hospitals and patient characteristics. We found poorer generalization particularly in hospitals with fewer samples, among patients with government and unspecified insurance, the elderly, and those with high comorbidities. To understand reasons for lack of generalization, we investigated sample sizes for fine-tuning, note content (number of words per note), patient characteristics (comorbidity level, age, insurance type, borough), and health system aspects (hospital, all-cause 30-day readmission, and mortality rates). We used descriptive statistics and supervised classification to identify features. We found that, along with sample size, patient age, number of comorbidities, and the number of words in notes are all important factors related to generalization. Finally, we compared local fine-tuning (hospital specific), instance-based augmented fine-tuning and cluster-based fine-tuning for improving generalization. Among these, local fine-tuning proved most effective, increasing AUC by 0.25% to 11.74% (most helpful in settings with limited data). Overall, this study provides new insights for enhancing the deployment of large language models in the societally important domain of healthcare, and improving their performance for broader populations.

研究の動機と目的

ClinicLLM がシステム内の病院間および患者グループ（保険、人種、年齢、併存疾患）を横断してどれくらい一般化するかを評価する。
一般化のギャップを生む要因を特定する（サンプルサイズ、ノート内容、患者特性、病院特性）。
一般化を改善する戦略を評価する：ローカルな病院固有のファインチューニング、インスタンスベースのオーギュメンテーション、クラスターベースのファインチューニング。
多様な医療環境で臨床LLMを展開するための実用的な洞察を提供する。

提案手法

109M パラメータの BERT-base アーキテクチャと MLM 目的を用いて、4つの病院の臨床ノートで ClinicLLM を事前学習。
History and Physicalノートを用いて再入院ラベルを2値でファインチューニングし、80-10-10 の train-val-test 分割と時間的テストセットを使用。
全ノートを用いたグローバルファインチューニングと病院固有のローカルファインチューニング、埋め込み類似性でマッチしたサンプルを用いたインスタンスベースの拡張ファインチューニング、UMAP 降下 + K-means クラスタリングによるクラスターベースファインチューニングを評価。
AUC、AUPR、ECE を主要指標として使用; 病院、保険タイプ、人種、年齢、併存疾病レベルごとの一般化を評価。
記述統計、パープレキシティ分析、および決定木ベースのクラスタリングによって一般化に影響を与える主要特徴を特定する。

実験結果

リサーチクエスチョン

RQ1病院間および患者サブグループ間で、 ClinicLLM は30日全原因再入院を予測する際にどの程度一般化しますか？
RQ2一般化の欠如に関連する主な要因は何か（サンプルサイズ、ノート長、年齢、併存疾患、保険、人種）？
RQ3ローカル、インスタンスベースの拡張、クラスターベースのファインチューニング戦略は一般化を改善しますか、どの程度まで？
RQ4データが限られている病院に対して、どの戦略がAUCの相対的改善を最大化しますか？

主な発見

グループ	項目	AUC (%)	AUPR (%)	ECE	再入院率 (%)	サンプルサイズ
Hospital	Hospital 1	74.60	34.10	0.21	14.80	102,275
Hospital	Hospital 2	73.04	29.69	0.22	13.70	51,545
Hospital	Hospital 3	69.90	20.70	0.27	9.70	4,502
Hospital	Hospital 4	51.20	14.40	0.42	14.40	3,451
Insurance Type	Government	65.15	32.72	0.22	20.30	54,705
Insurance Type	Private	76.43	30.01	0.22	11.20	105,328
Insurance Type	Self-Pay	77.78	13.02	0.38	6.30	1,257
Insurance Type	Other	64.03	16.71	0.35	16.80	483
Race Group	White	72.68	30.06	0.22	14.40	89,273
Race Group	Black	71.71	33.10	0.21	15.80	19,207
Race Group	Asian	76.56	33.84	0.19	23.00	16,592
Race Group	American Indian or Alaska Native	81.27	34.03	0.24	7.20	1,068
Race Group	Native Hawaiian or Other Pacific Islander	57.82	8.96	0.42	9.20	704
Race Group	Unknown	75.10	31.23	0.22	14.00	34,929
Age Group	Under 18	75.21	26.53	0.21	4.50	24,147
Age Group	Young Adult (18-35)	80.81	23.88	0.26	8.50	16,707
Age Group	Adult (35-60)	74.69	31.07	0.21	11.30	40,937
Age Group	Above 60	64.75	32.06	0.22	20.00	79,858
Comorbidities	Level 1 (Low)	74.76	24.46	0.25	9.40	110,258
Comorbidities	Level 2 (Moderate)	66.86	33.69	0.22	20.30	218,30
Comorbidities	Level 3 (High)	61.43	37.93	0.20	27.10	251,60
Comorbidities	Level 4 (Severe)	58.08	43.25	0.19	33.00	4,525

病院レベルの一般化は不均一であり、病院3と病院4は病院1より顕著に低いAUCを示し、時系列テストでそれぞれ69.90%と51.20%でした。
保険タイプと人種グループは性能が変動し、政府系およびその他/未特定の保険はSelf-Payより低いAUCを示す。アジア系およびアメリカン・インディアン系は比較的高いAUC、Native Hawaiian/Other Pacific Islander は低い。
年齢は一般化に大きく影響し、Above 60 が 64.75% の AUC を達成し、グループ中で最も低い。
併存疾患のレベルが高いほど AUC は低く（CCI Level 3–4：61.43%、58.08%）、ただし AUPR は併存疾患レベルとともに増加する。
ローカル病院固有のファインチューニングは最も大きな相対的 AUC 増分を生み、病院間での較正も改善（病院4で最大11.74%）します。
インスタンスベースの拡張およびクラスターベースのファインチューニングは、ローカルファインチューニングと比べて改善が小さいか変動的です。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。