[论文解读] The Capability of Large Language Models to Measure Psychiatric Functioning
Med-PaLM 2,一种以医学知识为调教的 LLM,能够从临床访谈中估算抑郁和 PTSD 的分数,其在抑郁方面的表现与人工评分者相当,在 PTSD 方面具有高特异性,且不需要特定任务的训练即可使用提示。
The current work investigates the capability of Large language models (LLMs) that are explicitly trained on large corpuses of medical knowledge (Med-PaLM 2) to predict psychiatric functioning from patient interviews and clinical descriptions without being trained to do so. To assess this, n = 145 depression and n =115 PTSD assessments and n = 46 clinical case studies across high prevalence/high comorbidity disorders (Depressive, Anxiety, Psychotic, trauma and stress, Addictive disorders) were analyzed using prompts to extract estimated clinical scores and diagnoses. Results demonstrate that Med-PaLM 2 is capable of assessing psychiatric functioning across a range of psychiatric conditions with the strongest performance being the prediction of depression scores based on standardized assessments (Accuracy range= 0.80 - 0.84) which were statistically indistinguishable from human clinical raters t(1,144) = 1.20; p = 0.23. Results show the potential for general clinical language models to flexibly predict psychiatric risk based on free descriptions of functioning from both patients and clinicians.
研究动机与目标
- Demonstrate whether a large medical knowledge–trained LLM (Med-PaLM 2) can predict psychiatric symptom severity and diagnoses from interviews without task-specific training.
- Evaluate the model's ability to estimate PHQ-8 (depression) and PCL-C (PTSD) scores and determine caseness.
- Assess the model's capacity to label DSM-5 diagnostic categories from case studies and describe its reasoning.
- Examine the explanations generated by the model to determine if they are diagnostically informative and clinically plausible.
提出的方法
- Use Med-PaLM 2 (L model) with prompts tailored to focus on PHQ-8 and PCL-C knowledge and to extract scores and confidence estimates.
- Prompt structure directs attention to relevant rating scales, requests score estimates, confidence, and descriptive reasoning.
- Compare model estimates to human raters using standard metrics (accuracy, sensitivity, specificity, MAE, RMSE, Cohen’s kappa, Pearson r).
- Analyze model-generated textual explanations and frequency of DSM-5–mapped terms to assess diagnostically descriptive capability.
- Utilize a case-study set from DSM-5 Clinical Cases to test broad diagnostic labeling without task-specific training.
实验结果
研究问题
- RQ1 Can Med-PaLM 2 predict PHQ-8 and PCL-C scores from clinical interviews without task-specific training?
- RQ2 How does the model's performance compare to human raters on depression and PTSD assessments in terms of accuracy, error, and diagnostic accuracy?
- RQ3 Is Med-PaLM 2 able to label DSM-5 diagnostic categories from case studies with high accuracy?
- RQ4 Do the model's explanations contain content that aligns with diagnostic reasoning for MDD and PTSD?
- RQ5 What are the limitations in identifying comorbidities or diagnostic modifiers using this approach?
主要发现
- For PHQ-8 (depression), accuracy is 0.80 and the model’s estimates are not statistically different from human raters (p = 0.23).
- For PCL-C (PTSD), accuracy is 0.74 and the model shows high specificity (0.98) but lower sensitivity (0.30).
- Model-human comparison shows Cohen’s kappa of 0.55 for PHQ-8 and 0.33 for PCL-C, indicating moderate agreement for depression and fair agreement for PTSD.
- Depression predictions achieved an MAE of 2.33 and RMSE of 3.93; PTSD predictions had MAE 9.07 and RMSE 11.2.
- The model correctly labeled diagnostic categories 92.5% of the time and the specific diagnosis 77.5% of the time in case studies.
- Model explanations and DSM-5–relevant terminology were more likely to appear when describing PHQ-8 and PCL-C results, indicating explainable summaries.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。