QUICK REVIEW

[논문 리뷰] The Imitation Game: Detecting Human and AI-Generated Texts in the Era of ChatGPT and BARD

Kadhim Hayawi, Sakib Shahriar|arXiv (Cornell University)|2023. 07. 22.

Topic Modeling인용 수 8

한 줄 요약

논문은 여러 장르에 걸친 인간이 작성한 텍스트와 LLM이 생성한 텍스트의 새로운 데이터셋을 제시하고, 인간 대 AI 텍스트를 구분하기 위해 여러 ML 모델을 평가하며, 이진 탐지에서 다중 클래스 분류보다 성능이 더 좋다.

ABSTRACT

The potential of artificial intelligence (AI)-based large language models (LLMs) holds considerable promise in revolutionizing education, research, and practice. However, distinguishing between human-written and AI-generated text has become a significant task. This paper presents a comparative study, introducing a novel dataset of human-written and LLM-generated texts in different genres: essays, stories, poetry, and Python code. We employ several machine learning models to classify the texts. Results demonstrate the efficacy of these models in discerning between human and AI-generated text, despite the dataset's limited sample size. However, the task becomes more challenging when classifying GPT-generated text, particularly in story writing. The results indicate that the models exhibit superior performance in binary classification tasks, such as distinguishing human-generated text from a specific LLM, compared to the more complex multiclass tasks that involve discerning among human-generated and multiple LLMs. Our findings provide insightful implications for AI text detection while our dataset paves the way for future research in this evolving area.

연구 동기 및 목표

교육, 연구, 실무에서 인간이 작성한 텍스트와 AI 생성 텍스트를 구분해야 할 필요성을 제시한다.
다양한 장르에 걸친 인간 작성 텍스트와 LLM 생성 텍스트를 담은 새로운 데이터셋을 소개한다.
AI 생성 콘텐츠를 탐지하는 능력을 평가하기 위해 다양한 머신 러닝 모델을 평가한다.

제안 방법

네 가지 장르(에세이, 이야기, 시, 파이썬 코드)의 텍스트를 이용해 데이터셋을 구성한다.
여러 머신 러닝 분류기를 적용해 인간 vs AI 생성 텍스트를 구분한다.
장르와 모델 유형 간의 차이를 주목하며 분류 성능을 분석한다.
이진(인간 vs 단일 LLM)과 다중 클래스(인간 + 다수 LLM) 설정을 비교한다.

실험 결과

연구 질문

RQ1다양한 장르에 걸쳐 ML 모델이 인간이 작성한 텍스트와 AI 생성 텍스트를 신뢰할 수 있게 구별할 수 있는가?
RQ2특정 LLM의 인간 텍스트와의 구분과 여러 LLM 및 인간 간의 구분에서 판단 성능이 어떻게 달라지는가?
RQ3특히 이야기 작성에서 GPT 생성 텍스트의 작업이 더 어려운가?
RQ4데이터셋 크기와 장르가 AI 텍스트 탐지 성능에 미치는 시사점은 무엇인가?

주요 결과

ML 모델은 장르에 걸쳐 인간 vs AI 생성 텍스트를 효과적으로 구분한다.
이진 작업(인간 vs 특정 LLM)에서 성능은 견고하지만 다수 LLM과 인간을 포함한 다중 클래스 설정에서 저하된다.
특히 이야기에서 GPT 생성 텍스트가 다른 경우보다 분류하기 어렵다.
제한된 샘플 크기에도 불구하고 데이터셋은 인간 텍스트와 AI 텍스트 간의 의미 있는 차별화 및 서로 다른 LLM 간의 차이를 뒷받침한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.