QUICK REVIEW

[論文レビュー] Towards Interpretable Mental Health Analysis with Large Language Models

Kailai Yang, Shaoxiong Ji|arXiv (Cornell University)|Apr 6, 2023

Mental Health via Writing被引用数 27

ひとこと要約

この論文は、11のデータセットと5つのタスクにわたり、複数の大規模言語モデルを mental health 分析に評価し、感情の手掛かりを用いた prompting 戦略を探求し、決定の説明を生成する可能性を研究している。

ABSTRACT

The latest large language models (LLMs) such as ChatGPT, exhibit strong capabilities in automated mental health analysis. However, existing relevant studies bear several limitations, including inadequate evaluations, lack of prompting strategies, and ignorance of exploring LLMs for explainability. To bridge these gaps, we comprehensively evaluate the mental health analysis and emotional reasoning ability of LLMs on 11 datasets across 5 tasks. We explore the effects of different prompting strategies with unsupervised and distantly supervised emotional information. Based on these prompts, we explore LLMs for interpretable mental health analysis by instructing them to generate explanations for each of their decisions. We convey strict human evaluations to assess the quality of the generated explanations, leading to a novel dataset with 163 human-assessed explanations. We benchmark existing automatic evaluation metrics on this dataset to guide future related works. According to the results, ChatGPT shows strong in-context learning ability but still has a significant gap with advanced task-specific methods. Careful prompt engineering with emotional cues and expert-written few-shot examples can also effectively improve performance on mental health analysis. In addition, ChatGPT generates explanations that approach human performance, showing its great potential in explainable mental health analysis.

研究の動機と目的

LLMsの一般的なメンタルヘルス分析能力と感情的推論能力を、ゼロショットおよびFew-shot設定で評価する。
prompting 戦略と感情の手掛かりがメンタルヘルスタスクの性能にどう影響するかを調査する。
LLMsが自分のメンタルヘルス分析判断の説明を生成する能力を探り、人間が評価した説明データセットを確立する。
explainable mental health analysis を評価する自動指標に関するベンチマークと洞察を提供する。

提案手法

代表的な4つのLLMを（LLaMA-7B/13B、InstructGPT-3、ChatGPT、MentalBERT/RoBERTa ベースライン）を11データセット5タスクで評価する。
prompting 戦略をテストする：ゼロショット、感情強化CoT、遠隔監視型感情プロンプトを含む；ファーストショットの専門家作成例を含む。
2つのLLM（ChatGPT、InstructGPT-3）に意思決定の自然言語による説明を生成させる；説明を厳密に人間評価する。
163件の人間評価済みの説明を含むデータセットを作成し、自動評価指標を人間 judgments に対してベンチマークする。
LLMのメンタルヘルス分析と説明可能性の限界を分析；不安定性と不正確な推論について議論。

Figure 1: The pipeline of obtaining and evaluating the LLM-generated explanations for mental health analysis. In LLM responses, red, green, and blue words are marked as relevant clues for rating fluency, reliability, and completeness in human evaluations.

実験結果

リサーチクエスチョン

RQ1RQ1: ゼロショット/少人数ショット設定において、LLMsは一般化されたメンタルヘルス分析と感情推論能力をどれだけ発揮できるか？
RQ2RQ2: 異なる prompting 戦略と感情的手掛かりは、ChatGPT のメンタルヘルス分析能力にどのような影響を与えるか？
RQ3RQ3: ChatGPT は自らの意思決定に対する説明をどれだけうまく生成できるか？

主な発見

ChatGPT は検討対象の LLM の中で一般的に最高の性能を達成するが、最先端の教師あり手法にはまだ及ばない。
感情強化 prompting（特にCoTを用いた無監督の感情プロンプト）と少数ショットの専門家例は性能を大幅に向上させ、いくつかのタスクで最先端に近づく。
ChatGPT は流暢さ、信頼性、完成度の人間レベルに近い説明を生成できる可能性を示し、 explainable mental health analysis の強い可能性を示す。
現行の自動評価指標は説明に関して人間評価と中程度に相関する； explainable mental health analysis にはタスク固有の指標が必要。
ChatGPT は予測が不安定で時に推論が不正確になることがあり、プロンプト文言に影響される；少数ショットの prompts は不安定さを緩和するのに役立つ。
本研究は163件の人間評価付き説明の新しい注釈コーパスを提供し、 prompting における感情手掛かりの有用性を示し、 explainability の複数の自動指標をベンチマークする。）

Figure 2: Box plots of the aggregated human evaluation scores for each aspect. Orange lines denote the median scores and green lines denote the average scores.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。