QUICK REVIEW

[論文レビュー] Are Large Language Models More Empathetic than Humans?

Anuradha Welivita, Pearl Pu|arXiv (Cornell University)|Jun 7, 2024

Topic Modeling被引用数 5

ひとこと要約

本論文は、4つの最先端LLM（GPT-4、LLaMA-2-70B-Chat、Gemini-1.0-Pro、Mixtral-8x7B-Instruct）の共感的応答を人間ベースラインと比較する被験者間デザインの研究を実施し、2,000件のプロンプトを1,000名の参加者が評価した結果、LLMs が一般に人間より共感評価で優れていることを示している。

ABSTRACT

With the emergence of large language models (LLMs), investigating if they can surpass humans in areas such as emotion recognition and empathetic responding has become a focal point of research. This paper presents a comprehensive study exploring the empathetic responding capabilities of four state-of-the-art LLMs: GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8x7B-Instruct in comparison to a human baseline. We engaged 1,000 participants in a between-subjects user study, assessing the empathetic quality of responses generated by humans and the four LLMs to 2,000 emotional dialogue prompts meticulously selected to cover a broad spectrum of 32 distinct positive and negative emotions. Our findings reveal a statistically significant superiority of the empathetic responding capability of LLMs over humans. GPT-4 emerged as the most empathetic, marking approximately 31% increase in responses rated as "Good" compared to the human benchmark. It was followed by LLaMA-2, Mixtral-8x7B, and Gemini-Pro, which showed increases of approximately 24%, 21%, and 10% in "Good" ratings, respectively. We further analyzed the response ratings at a finer granularity and discovered that some LLMs are significantly better at responding to specific emotions compared to others. The suggested evaluation framework offers a scalable and adaptable approach for assessing the empathy of new LLMs, avoiding the need to replicate this study's findings in future research.

研究の動機と目的

広範な感情にわたり、LLMs が人間より高い共感的応答を示すことができるかを動機づけ、定量化する。
前の研究を再実施することなく、LLMの共感を評価するための、拡張性があり適応可能な評価フレームワークを開発する。
EmpatheticDialoguesプロンプトを用いた人間ベースラインを作成し、複数の現代的LLMと比較する。
将来のベンチマークの再現性を促進するため、プロンプト、応答、評価を公開する。

提案手法

人間、GPT-4、LLaMA-2-70B-Chat、Gemini-1.0-Pro、Mixtral-8x7B-Instruct の5群による被験者間デザイン。
EmpatheticDialoguesデータセットから32の感情にまたがる2,000の対話プロンプトを評価コーパスとして使用。
認知的、情動的、思いやりの共感を含む共感定義指示でLLMsにプロンプトを与える。
評価は1,000名の参加者（各グループ200名）から3段階のBad/Okay/Goodスケールで収集。
群間でGood/Okay/Badの割合を比較するため、独立性のカイ二乗検定による統計分析。

実験結果

リサーチクエスチョン

RQ1広範な感情にわたって、LLMs は人間より高い共感的応答品質を示すか。
RQ2異なる感情に対して、LLMs 間で共感性能にばらつきがあるか。
RQ3被験者間デザインは、従来の被験者内デザインと比較して、進化するLLMsの評価において頑健で拡張性のある評価を提供するか。

主な発見

GPT-4 は人間より約31%多くGood評価を獲得し、最大の改善を達成した（統計的に有意）。
LLaMA-2、Mixtral-8x7B、Gemini-Proはそれぞれ人間より約24%、約21%、約10%のGood評価の利得を示した。
4つのLLMは、正と負の感情の双方でGood評価で人間を上回り、GPT-4が多数のカテゴリでリードした。
感情ごとの有意差が観察され、一部のLLMは特定の感情で卓越していた（例：GPT-4はImpressed、Surprised、Grateful、Proud など）。
正の感情ではGPT-4、LLaMA-2、Mixtral-8x7Bで大きな利得が一般的に見られた。Gemini-Proは正の感情での利得は明確ではなかったが、いくつかの負の感情でより良い結果を示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。