QUICK REVIEW

[논문 리뷰] Evaluation of ChatGPT for NLP-based Mental Health Applications

Bishal Lamichhane|arXiv (Cornell University)|2023. 03. 28.

Mental Health via Writing인용 수 57

한 줄 요약

본 논문은 공개 소셜 미디어 데이터셋을 이용하여 세 가지 정신 건강 텍스트 분류 태스크(스트레스, 우울, 자살 위험)에 대해 zero-shot ChatGPT(GPT-3.5-turbo)를 평가하고, 간단한 베이스라인 대비 각각 F1 점수 0.73, 0.86, 0.37를 보고한다.

ABSTRACT

Large language models (LLM) have been successful in several natural language understanding tasks and could be relevant for natural language processing (NLP)-based mental health application research. In this work, we report the performance of LLM-based ChatGPT (with gpt-3.5-turbo backend) in three text-based mental health classification tasks: stress detection (2-class classification), depression detection (2-class classification), and suicidality detection (5-class classification). We obtained annotated social media posts for the three classification tasks from public datasets. Then ChatGPT API classified the social media posts with an input prompt for classification. We obtained F1 scores of 0.73, 0.86, and 0.37 for stress detection, depression detection, and suicidality detection, respectively. A baseline model that always predicted the dominant class resulted in F1 scores of 0.35, 0.60, and 0.19. The zero-shot classification accuracy obtained with ChatGPT indicates a potential use of language models for mental health classification tasks.

연구 동기 및 목표

ChatGPT의 NLP 기반 정신 건강 태스크에 대한 공개 소셜 미디어 데이터셋을 이용한 제로샷 분류 성능 평가.
ChatGPT 출력과 베이스라인 지배 모델을 비교하여 성능 벤치마크를 확립.
오류 패턴을 분석하고 정신 건강 애플리케이션의 백엔드로 LLM을 사용할 때의 시사점을 논의.

제안 방법

각 게시물에 대해 단일 클래스 프롬프트로 OpenAI API를 사용한 GPT-3.5-turbo ChatGPT 활용.
세 가지 태스크 평가: 스트레스 탐지(2-클래스), 우울 탐지(2-클래스), 자살 위험 탐지(5-클래스).
F1 점수(다중 클래스에서 가중 평균)와 균형 정확도 계산, 그리고 각 태스크의 혼동 행렬 분석.
데이터셋 소스: Reddit 기반 게시물로부터의 스트레스 탐지 데이터셋; Reddit 및 블로그로부터의 우울 탐지; 라벨링된 5-클래스 데이터세트로부터의 자살 위험 탐지.
항상 지배 클래스만 예측하는 베이스라인 모델과의 결과 비교

실험 결과

연구 질문

RQ1Can zero-shot ChatGPT reliably classify social-media text into stress/non-stress, depression/non-depression, and five suicidality-related classes?
RQ2How does ChatGPT’s zero-shot performance compare to simple baseline predictors on these mental health tasks?
RQ3What do confusion matrices reveal about inter-class confusion, especially in the suicidality five-class setting?],
RQ4key_findings:[]

주요 결과

Dataset	F1 score	Balanced Accuracy
Stress Detection	0.73	0.73
Depression Detection	0.86	0.85
Suicidality Detection	0.37	0.33

스트레스 탐지에서 F1 = 0.73(베이스라인 0.35).
우울 탐지에서 F1 = 0.86(베이스라인 0.60).
자살 위험 탐지에서 F1 = 0.37(베이스라인 0.19).
균형 정확도: 스트레스 0.73, 우울 0.85, 자살 위험 0.33.
제로샷 ChatGPT는 베이스라인 대비 유망한 성능을 보이며, 파인튜닝 또는 프롬프트 변형을 통한 추가 개선 가능성 있음

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.