Skip to main content
QUICK REVIEW

[论文解读] The Capability of Large Language Models to Measure Psychiatric Functioning

Isaac R. Galatzer‐Levy, Daniel McDuff|arXiv (Cornell University)|Aug 3, 2023
Mental Health via Writing被引用 24
一句话总结

Med-PaLM 2,一种以医学知识为调教的 LLM,能够从临床访谈中估算抑郁和 PTSD 的分数,其在抑郁方面的表现与人工评分者相当,在 PTSD 方面具有高特异性,且不需要特定任务的训练即可使用提示。

ABSTRACT

The current work investigates the capability of Large language models (LLMs) that are explicitly trained on large corpuses of medical knowledge (Med-PaLM 2) to predict psychiatric functioning from patient interviews and clinical descriptions without being trained to do so. To assess this, n = 145 depression and n =115 PTSD assessments and n = 46 clinical case studies across high prevalence/high comorbidity disorders (Depressive, Anxiety, Psychotic, trauma and stress, Addictive disorders) were analyzed using prompts to extract estimated clinical scores and diagnoses. Results demonstrate that Med-PaLM 2 is capable of assessing psychiatric functioning across a range of psychiatric conditions with the strongest performance being the prediction of depression scores based on standardized assessments (Accuracy range= 0.80 - 0.84) which were statistically indistinguishable from human clinical raters t(1,144) = 1.20; p = 0.23. Results show the potential for general clinical language models to flexibly predict psychiatric risk based on free descriptions of functioning from both patients and clinicians.

研究动机与目标

  • Demonstrate whether a large medical knowledge–trained LLM (Med-PaLM 2) can predict psychiatric symptom severity and diagnoses from interviews without task-specific training.
  • Evaluate the model's ability to estimate PHQ-8 (depression) and PCL-C (PTSD) scores and determine caseness.
  • Assess the model's capacity to label DSM-5 diagnostic categories from case studies and describe its reasoning.
  • Examine the explanations generated by the model to determine if they are diagnostically informative and clinically plausible.

提出的方法

  • Use Med-PaLM 2 (L model) with prompts tailored to focus on PHQ-8 and PCL-C knowledge and to extract scores and confidence estimates.
  • Prompt structure directs attention to relevant rating scales, requests score estimates, confidence, and descriptive reasoning.
  • Compare model estimates to human raters using standard metrics (accuracy, sensitivity, specificity, MAE, RMSE, Cohen’s kappa, Pearson r).
  • Analyze model-generated textual explanations and frequency of DSM-5–mapped terms to assess diagnostically descriptive capability.
  • Utilize a case-study set from DSM-5 Clinical Cases to test broad diagnostic labeling without task-specific training.

实验结果

研究问题

  • RQ1 Can Med-PaLM 2 predict PHQ-8 and PCL-C scores from clinical interviews without task-specific training?
  • RQ2 How does the model's performance compare to human raters on depression and PTSD assessments in terms of accuracy, error, and diagnostic accuracy?
  • RQ3 Is Med-PaLM 2 able to label DSM-5 diagnostic categories from case studies with high accuracy?
  • RQ4 Do the model's explanations contain content that aligns with diagnostic reasoning for MDD and PTSD?
  • RQ5 What are the limitations in identifying comorbidities or diagnostic modifiers using this approach?

主要发现

  • For PHQ-8 (depression), accuracy is 0.80 and the model’s estimates are not statistically different from human raters (p = 0.23).
  • For PCL-C (PTSD), accuracy is 0.74 and the model shows high specificity (0.98) but lower sensitivity (0.30).
  • Model-human comparison shows Cohen’s kappa of 0.55 for PHQ-8 and 0.33 for PCL-C, indicating moderate agreement for depression and fair agreement for PTSD.
  • Depression predictions achieved an MAE of 2.33 and RMSE of 3.93; PTSD predictions had MAE 9.07 and RMSE 11.2.
  • The model correctly labeled diagnostic categories 92.5% of the time and the specific diagnosis 77.5% of the time in case studies.
  • Model explanations and DSM-5–relevant terminology were more likely to appear when describing PHQ-8 and PCL-C results, indicating explainable summaries.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。