QUICK REVIEW

[论文解读] Evaluation of ChatGPT for NLP-based Mental Health Applications

Bishal Lamichhane|arXiv (Cornell University)|Mar 28, 2023

Mental Health via Writing被引用 57

一句话总结

本文在三项心理健康文本分类任务（压力、抑郁、自杀意向）上对零-shot 的 ChatGPT（GPT-3.5-turbo）进行评估，使用公开的社交媒体数据集，报告的 F1 分别为 0.73、0.86 和 0.37，相对于简单基线。

ABSTRACT

Large language models (LLM) have been successful in several natural language understanding tasks and could be relevant for natural language processing (NLP)-based mental health application research. In this work, we report the performance of LLM-based ChatGPT (with gpt-3.5-turbo backend) in three text-based mental health classification tasks: stress detection (2-class classification), depression detection (2-class classification), and suicidality detection (5-class classification). We obtained annotated social media posts for the three classification tasks from public datasets. Then ChatGPT API classified the social media posts with an input prompt for classification. We obtained F1 scores of 0.73, 0.86, and 0.37 for stress detection, depression detection, and suicidality detection, respectively. A baseline model that always predicted the dominant class resulted in F1 scores of 0.35, 0.60, and 0.19. The zero-shot classification accuracy obtained with ChatGPT indicates a potential use of language models for mental health classification tasks.

研究动机与目标

使用公开的社交媒体数据集，评估 ChatGPT 在 NLP 基础的心理健康任务中的零-shot 分类性能。
将 ChatGPT 的输出与主导基线模型进行比较，以建立性能基准。
分析混淆模式，并讨论将大型语言模型作为心理健康应用后端的含义。

提出的方法

通过 OpenAI API 使用 GPT-3.5-turbo ChatGPT，对每条帖子给出一个单一类别的提示。
评估三项任务：压力检测（2 类）、抑郁检测（2 类）、自杀意向检测（5 类）。
计算 F1 分数（多类加权）和平衡准确度，并检查每个任务的混淆矩阵。
数据集来源：压力检测数据集来自基于 Reddit 的帖子；抑郁检测来自 Reddit 和博客；自杀意向检测来自一个带有标签的 5 类数据集。
将结果与始终预测主导类的基线模型进行比较。

实验结果

研究问题

RQ1零-shot 的 ChatGPT 能否可靠地将社交媒体文本分类为 stress/non-stress、depression/non-depression，以及五个自杀意向相关类别？
RQ2ChatGPT 的零-shot 性能与这些心理健康任务中的简单基线预测器相比如何？
RQ3混淆矩阵揭示了哪些关于类间混淆的信息，特别是在五类自杀意向设置中？

主要发现

数据集	F1 分数	平衡准确度
Stress Detection	0.73	0.73
Depression Detection	0.86	0.85
Suicidality Detection	0.37	0.33

压力检测达到 F1 = 0.73（基线 0.35）。
抑郁检测达到 F1 = 0.86（基线 0.60）。
自杀意向检测达到 F1 = 0.37（基线 0.19）。
平衡准确度：压力 0.73，抑郁 0.85，自杀意向 0.33。
零-shot 的 ChatGPT 相对于基线显示出有前景的性能，未来通过微调或提示变体可能进一步提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。