QUICK REVIEW

[논문 리뷰] A Linguistic Comparison between Human and ChatGPT-Generated Conversations

Morgan Sandler, Hyesun Choung|arXiv (Cornell University)|2024. 01. 29.

Artificial Intelligence in Healthcare and Education인용 수 10

한 줄 요약

이 논문은 LIWC를 사용하여 인간과 ChatGPT-3.5 대화 간의 언어 차이를 118개 범주에서 분석하고 19.5K ChatGPT 대화와 EmpathicDialogues를 비교하며, 인간이 더 진정성 있는 반면 ChatGPT는 사회적, 인지적, 긍정적 어조 특성이 더 강하다는 것을 발견한다.

ABSTRACT

This study explores linguistic differences between human and LLM-generated dialogues, using 19.5K dialogues generated by ChatGPT-3.5 as a companion to the EmpathicDialogues dataset. The research employs Linguistic Inquiry and Word Count (LIWC) analysis, comparing ChatGPT-generated conversations with human conversations across 118 linguistic categories. Results show greater variability and authenticity in human dialogues, but ChatGPT excels in categories such as social processes, analytical style, cognition, attentional focus, and positive emotional tone, reinforcing recent findings of LLMs being "more human than human." However, no significant difference was found in positive or negative affect between ChatGPT and human dialogues. Classifier analysis of dialogue embeddings indicates implicit coding of the valence of affect despite no explicit mention of affect in the conversations. The research also contributes a novel, companion ChatGPT-generated dataset of conversations between two independent chatbots, which were designed to replicate a corpus of human conversations available for open access and used widely in AI research on language modeling. Our findings enhance understanding of ChatGPT's linguistic capabilities and inform ongoing efforts to distinguish between human and LLM-generated text, which is critical in detecting AI-generated fakes, misinformation, and disinformation.

연구 동기 및 목표

AI 생성 대화의 진정성 및 AI-생성 텍스트 탐지를 다루기 위해 인간과 LLM-생성 대화 간의 차이를 이해하도록 동기를 부여한다.
LIWC를 활용하여 언어적 특징을 프로파일링하고 인간과 ChatGPT 대화 간의 변이성 및 진정성을 비교한다.
NLP 연구를 돕기 위해 새로운 ChatGPT-생성 동반 데이터셋 (2GPTEmpathicDialogues)을 제공한다.
대화에 명시적 정서 언급이 없더라도 임베딩에서 잠재적 정서 신호를 조사한다.

제안 방법

LIWC-22를 사용하여 19.5K 대화에서 118개의 언어 범주를 코딩한다(인간 대 두 개의 ChatGPT 인스턴스).
EmpathicDialogues 시나리오를 모방하기 위해 두 개의 ChatGPT-3.5-Turbo 인스턴스를 조정하여 2GPTEmpathicDialogues를 생성한다.
범주 평균을 비교하기 위해 보정 Bonferroni (p<.001)로 독립표본 t-검정을 적용하고 분산 차이를 확인하기 위해 Levene의 검정을 수행한다.
5-겹 교차검증으로 OpenAI text-embedding-ada-002 임베딩에서 가치(정서) 분류기를 학습하고 평가한다(랜덤 포레스트, SVM, MLP).
UMAP를 사용하여 가치별 임베딩 분포를 시각화하고 클러스터 분리 지표인 Dunn Index를 계산한다.

Figure 1 : Framework for generation and prompts used in creating the 2GPTEmpathicDialogues dataset. In this setup, two instances of the ChatGPT-3.5-Turbo API engage in conversation with each other through a coordinating program.

실험 결과

연구 질문

RQ1ChatGPT-생성 대화가 LIWC 범주 전반에서 변이성, 진정성, 사회적 행동, 인지 및 정서 측면에서 인간 대화와 차이가 있는가?
RQ2명시적 정서 언급이 없더라도 임베딩 기반의 가치(정서) 분류가 ChatGPT 대화와 인간 대화 간의 잠재적 정서 신호를 감지할 수 있는가?
RQ3ChatGPT-생성 동반 데이터셋(2GPTEmpathicDialogues)이 언어 분석을 위한 EmpathicDialogues 인간 말뭉치를 밀접하게 반영하는가?
RQ4언어적 차이가 AI-텍스트 탐지 및 허위정보 위험에 미치는 함의는 무엇인가?

주요 결과

인간은 LIWC 범주에서 ChatGPT보다 더 큰 변이성과 더 큰 진정성을 보인다.
ChatGPT는 사회적 과정, 친사회적 행동, 정중함, 의사소통, 주의집중, 분석적 사고, 인지 및 긍정적 정서 어조에서 더 높은 수준을 보인다.
전반적인 긍정적 또는 부정적 정동에서 ChatGPT와 인간 간에 유의한 차이가 없다.
ChatGPT 임베딩은 잠재적 가치 신호를 보이며 분류기가 높은 F1 점수를 달성한다(SVM 90.0% on humans, 95.3% on ChatGPT).
UMAP는 ChatGPT 임베딩에서 더 뚜렷한 가치 군집화(Dunn Index 0.222)를 인간(0.153)보다 보여준다.
정서 분류에서 상위로 잘못 분류된 감정은 anxious, surprised, trusting, caring, sentimental, hopeful 등이며 두 데이터셋 모두에 걸쳐 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.