QUICK REVIEW

[论文解读] The Capability of Large Language Models to Measure Psychiatric Functioning

Isaac R. Galatzer‐Levy, Daniel McDuff|arXiv (Cornell University)|Aug 3, 2023

Mental Health via Writing被引用 24

一句话总结

Med-PaLM 2，一种以医学知识为调教的 LLM，能够从临床访谈中估算抑郁和 PTSD 的分数，其在抑郁方面的表现与人工评分者相当，在 PTSD 方面具有高特异性，且不需要特定任务的训练即可使用提示。

ABSTRACT

The current work investigates the capability of Large language models (LLMs) that are explicitly trained on large corpuses of medical knowledge (Med-PaLM 2) to predict psychiatric functioning from patient interviews and clinical descriptions without being trained to do so. To assess this, n = 145 depression and n =115 PTSD assessments and n = 46 clinical case studies across high prevalence/high comorbidity disorders (Depressive, Anxiety, Psychotic, trauma and stress, Addictive disorders) were analyzed using prompts to extract estimated clinical scores and diagnoses. Results demonstrate that Med-PaLM 2 is capable of assessing psychiatric functioning across a range of psychiatric conditions with the strongest performance being the prediction of depression scores based on standardized assessments (Accuracy range= 0.80 - 0.84) which were statistically indistinguishable from human clinical raters t(1,144) = 1.20; p = 0.23. Results show the potential for general clinical language models to flexibly predict psychiatric risk based on free descriptions of functioning from both patients and clinicians.

研究动机与目标

Demonstrate whether a large medical knowledge–trained LLM (Med-PaLM 2) can predict psychiatric symptom severity and diagnoses from interviews without task-specific training.
Evaluate the model's ability to estimate PHQ-8 (depression) and PCL-C (PTSD) scores and determine caseness.
Assess the model's capacity to label DSM-5 diagnostic categories from case studies and describe its reasoning.
Examine the explanations generated by the model to determine if they are diagnostically informative and clinically plausible.

提出的方法

Use Med-PaLM 2 (L model) with prompts tailored to focus on PHQ-8 and PCL-C knowledge and to extract scores and confidence estimates.
Prompt structure directs attention to relevant rating scales, requests score estimates, confidence, and descriptive reasoning.
Compare model estimates to human raters using standard metrics (accuracy, sensitivity, specificity, MAE, RMSE, Cohen’s kappa, Pearson r).
Analyze model-generated textual explanations and frequency of DSM-5–mapped terms to assess diagnostically descriptive capability.
Utilize a case-study set from DSM-5 Clinical Cases to test broad diagnostic labeling without task-specific training.

实验结果

研究问题

RQ1 Can Med-PaLM 2 predict PHQ-8 and PCL-C scores from clinical interviews without task-specific training?
RQ2 How does the model's performance compare to human raters on depression and PTSD assessments in terms of accuracy, error, and diagnostic accuracy?
RQ3 Is Med-PaLM 2 able to label DSM-5 diagnostic categories from case studies with high accuracy?
RQ4 Do the model's explanations contain content that aligns with diagnostic reasoning for MDD and PTSD?
RQ5 What are the limitations in identifying comorbidities or diagnostic modifiers using this approach?

主要发现

For PHQ-8 (depression), accuracy is 0.80 and the model’s estimates are not statistically different from human raters (p = 0.23).
For PCL-C (PTSD), accuracy is 0.74 and the model shows high specificity (0.98) but lower sensitivity (0.30).
Model-human comparison shows Cohen’s kappa of 0.55 for PHQ-8 and 0.33 for PCL-C, indicating moderate agreement for depression and fair agreement for PTSD.
Depression predictions achieved an MAE of 2.33 and RMSE of 3.93; PTSD predictions had MAE 9.07 and RMSE 11.2.
The model correctly labeled diagnostic categories 92.5% of the time and the specific diagnosis 77.5% of the time in case studies.
Model explanations and DSM-5–relevant terminology were more likely to appear when describing PHQ-8 and PCL-C results, indicating explainable summaries.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。