QUICK REVIEW

[논문 리뷰] Is ChatGPT More Empathetic than Humans?

Anuradha Welivita, Pearl Pu|arXiv (Cornell University)|2024. 02. 22.

Artificial Intelligence in Healthcare and Education인용 수 14

한 줄 요약

연구는 600명의 참가자를 대상으로 GPT-4–생성 공감 반응과 인간 반응을 비교하는 between-subjects 설계를 사용하여, GPT-4가 종종 더 공감적으로 평가되는 경향이 있으며, 특히 공감 정의 프롬프트와 함께 더 그렇습니다.

ABSTRACT

This paper investigates the empathetic responding capabilities of ChatGPT, particularly its latest iteration, GPT-4, in comparison to human-generated responses to a wide range of emotional scenarios, both positive and negative. We employ a rigorous evaluation methodology, involving a between-groups study with 600 participants, to evaluate the level of empathy in responses generated by humans and ChatGPT. ChatGPT is prompted in two distinct ways: a standard approach and one explicitly detailing empathy's cognitive, affective, and compassionate counterparts. Our findings indicate that the average empathy rating of responses generated by ChatGPT exceeds those crafted by humans by approximately 10%. Additionally, instructing ChatGPT to incorporate a clear understanding of empathy in its responses makes the responses align approximately 5 times more closely with the expectations of individuals possessing a high degree of empathy, compared to human responses. The proposed evaluation framework serves as a scalable and adaptable framework to assess the empathetic capabilities of newer and updated versions of large language models, eliminating the need to replicate the current study's results in future research.

연구 동기 및 목표

GPT-4 (GPT-4) 응답이 잡담 스타일의 대화에서 인간 응답과 비교했을 때 얼마나 공감적인가를 평가한다.
두 가지 GPT-4 프롬프트 전략을 평가한다: vanilla (일반적)와 empathy-defined (인지적, 정의적, 그리고 동정적 구성요소).
향후 LLM 공감 평가에 적합한 확장 가능한 평가 프레임워크를 검증하고, 단일 모델 버전 외의 일반화된 결과를 도출한다.

제안 방법

32개 감정에 분포된 2,000개 대화의 EmpatheticDialogues 데이터셋을 사용한다.
사람, GPT-4 (vanilla), GPT-4 (empathy-defined)로부터의 응답을 평가하는 600명의 크라우드 워커를 대상으로 between-groups 연구를 수행한다.
각 대화의 첫 대답에 대해 vanilla와 empathy-defined의 두 가지 지시 스타일로 GPT-4에 프롬트를 제공한다.
공감을 세 점 척도(Bad, Okay, Good)로 평가하고 일원배치 분산분석(ANOVA)과 t-검정으로 분석한다.
평가자의 공감 성향을 Toronto Empathy Questionnaire (TEQ)로 측정하고 평가와의 상호작용을 분석한다.

실험 결과

연구 질문

RQ1GPT-4가 다양한 정서적 시나리오에서 인간보다 더 공감적인 응답을 생성하는가?
RQ2프롬프트에서 공감을 명시적으로 정의하는 것이 GPT-4의 고공감 평가자와의 정렬에 도움이 되는가?
RQ3긍정적 정서 맥락과 부정적 정서 맥락에서 공감 평가는 어떻게 다른가?
RQ4평가자의 고유한 공감(TEQ)과 GPT-4 대 인간 응답에 대한 평점 간에 관계가 있는가?

주요 결과

GPT-4 (vanilla) 및 GPT-4 (empathy-defined)는 모든 감정에서 인간보다 더 높은 평균 공감 평점을 받는다.
GPT-4 (empathy-defined)는 모든 감정 및 부정적 감정에 대해 최고 평균 평점을 산출하며, 인간보다 각각 약 11.21% 및 9.61% 증가를 보인다.
GPT-4 (vanilla)는 긍정적 감정에서 인간보다 평균 공감 평점이 13.14% 높다.
GPT-4 (empathy-defined)와 GPT-4 (vanilla) 간 차이는 전체적으로 통계적으로 유의미하지 않다(p > 0.05).
공감 성향이 더 높은 평가자는 GPT-4 (empathy-defined)를 더 높게 평가하는 경향이 있으며, 인간이나 GPT-4 (vanilla)보다 기울기가 더 강하다.
정성적 예시에서 GPT-4는 공감 정의 지침으로 프롬프트된 경우 비지시적이면서도 더 공감적인 커뮤니케이션을 할 수 있음을 나타낸다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.