Skip to main content
QUICK REVIEW

[论文解读] Towards Interpretable Mental Health Analysis with Large Language Models

Kailai Yang, Shaoxiong Ji|arXiv (Cornell University)|Apr 6, 2023
Mental Health via Writing被引用 27
一句话总结

本论文全面评估多种大型语言模型(LLMs)在11个数据集和5个任务上的心理健康分析能力,探索带有情感提示的提示策略,并研究LLMs生成对其决策的解释的潜力。

ABSTRACT

The latest large language models (LLMs) such as ChatGPT, exhibit strong capabilities in automated mental health analysis. However, existing relevant studies bear several limitations, including inadequate evaluations, lack of prompting strategies, and ignorance of exploring LLMs for explainability. To bridge these gaps, we comprehensively evaluate the mental health analysis and emotional reasoning ability of LLMs on 11 datasets across 5 tasks. We explore the effects of different prompting strategies with unsupervised and distantly supervised emotional information. Based on these prompts, we explore LLMs for interpretable mental health analysis by instructing them to generate explanations for each of their decisions. We convey strict human evaluations to assess the quality of the generated explanations, leading to a novel dataset with 163 human-assessed explanations. We benchmark existing automatic evaluation metrics on this dataset to guide future related works. According to the results, ChatGPT shows strong in-context learning ability but still has a significant gap with advanced task-specific methods. Careful prompt engineering with emotional cues and expert-written few-shot examples can also effectively improve performance on mental health analysis. In addition, ChatGPT generates explanations that approach human performance, showing its great potential in explainable mental health analysis.

研究动机与目标

  • 在零-shot与少量示例设置下,评估LLMs的一般心理健康分析和情感推理能力。
  • 探究提示策略和情感提示如何影响LLMs在心理健康任务上的表现。
  • 探索LLMs生成其心理健康分析决策解释的能力,并建立一个人工评估解释的数据集。
  • 提供一个基准和可用于评估可解释心理健康分析的自动评价指标的洞见。

提出的方法

  • 在11个数据集、5个任务上评估四个代表性LLMs(LLaMA-7B/13B、InstructGPT-3、ChatGPT、MentalBERT/RoBERTa 基线)。
  • 测试提示策略:零-shot、情感增强的链路推理(CoT)、以及远距离监督情感提示;包含少量-shot 专家撰写的示例。
  • 指示两个LLMs(ChatGPT、InstructGPT-3)生成对决策的自然语言解释;对解释进行严格的人类评估。
  • 创建一个包含163条人工评估解释的数据集;用人类判断对比自动评估指标的基准。
  • 分析LLMs在心理健康分析和可解释性方面的局限性;讨论不稳定性和不准确推理。
Figure 1: The pipeline of obtaining and evaluating the LLM-generated explanations for mental health analysis. In LLM responses, red, green, and blue words are marked as relevant clues for rating fluency, reliability, and completeness in human evaluations.
Figure 1: The pipeline of obtaining and evaluating the LLM-generated explanations for mental health analysis. In LLM responses, red, green, and blue words are marked as relevant clues for rating fluency, reliability, and completeness in human evaluations.

实验结果

研究问题

  • RQ1RQ1:LLMs在零-shot/少量-shot设置下在一般化心理健康分析和情感推理方面的表现有多好?
  • RQ2RQ2:不同的提示策略和情感提示如何影响ChatGPT在心理健康分析能力?
  • RQ3RQ3:ChatGPT在心理健康分析决策方面生成解释的能力有多强?

主要发现

  • ChatGPT在所考察的LLMs中通常达到最佳性能,但仍落后于更先进的监督方法。
  • 情感增强提示(尤其是带有CoT的无监督情感提示)结合少量-shot专家示例显著提升性能,在某些任务上接近最先进水平。
  • ChatGPT能够生成在流畅性、可靠性和完整性方面接近人类水平质量的解释,表明在可解释心理健康分析方面具有强大潜力。
  • 当前的自动评估指标与人类判定在解释方面的相关性中等;在可解释心理健康分析中需要针对任务的指标。
  • ChatGPT显示出不稳定的预测和有时不准确的推理,受提示措辞影响;少量-shot提示有助于缓解部分不稳定性。
  • 该研究提供了一个新的163条人工评估解释的注释数据集,演示了情感提示在提示中的有效性;并对多种自动评估指标在可解释性方面进行了基准。
Figure 2: Box plots of the aggregated human evaluation scores for each aspect. Orange lines denote the median scores and green lines denote the average scores.
Figure 2: Box plots of the aggregated human evaluation scores for each aspect. Orange lines denote the median scores and green lines denote the average scores.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。