QUICK REVIEW

[論文レビュー] The Imitation Game: Detecting Human and AI-Generated Texts in the Era of ChatGPT and BARD

Kadhim Hayawi, Sakib Shahriar|arXiv (Cornell University)|Jul 22, 2023

Topic Modeling被引用数 8

ひとこと要約

この論文はジャンルを跨ぐ人間作成とLLM生成テキストの新しいデータセットを提示し、人間 vs AIテキストを識別する複数のMLモデルを評価します。二値検出の性能は多クラス分類よりも高いです。

ABSTRACT

The potential of artificial intelligence (AI)-based large language models (LLMs) holds considerable promise in revolutionizing education, research, and practice. However, distinguishing between human-written and AI-generated text has become a significant task. This paper presents a comparative study, introducing a novel dataset of human-written and LLM-generated texts in different genres: essays, stories, poetry, and Python code. We employ several machine learning models to classify the texts. Results demonstrate the efficacy of these models in discerning between human and AI-generated text, despite the dataset's limited sample size. However, the task becomes more challenging when classifying GPT-generated text, particularly in story writing. The results indicate that the models exhibit superior performance in binary classification tasks, such as distinguishing human-generated text from a specific LLM, compared to the more complex multiclass tasks that involve discerning among human-generated and multiple LLMs. Our findings provide insightful implications for AI text detection while our dataset paves the way for future research in this evolving area.

研究の動機と目的

教育・研究・実務において人間作成とAI生成テキストを区別する必要性を動機づける。
複数ジャンルにまたがる人間作成とLLM生成テキストを含む新規データセットを紹介する。
AI生成コンテンツを検出する能力を評価するため、さまざまな機械学習モデルを評価する。

提案手法

4ジャンル（エッセイ、ストーリー、詩、Pythonコード）からのテキストでデータセットを構築する。
人間対AI生成テキストを識別するために、いくつかの機械学習分類器を適用する。
ジャンルとモデルタイプの差異を指摘しつつ、分類性能を分析する。
バイナリ（人間対特定のLLM）とマルチクラス（人間＋複数LLM）設定を比較する。

実験結果

リサーチクエスチョン

RQ1MLモデルはジャンルを超えて人間作成とAI生成テキストを信頼性高く識別できるか。
RQ2特定のLLMと人間テキストを分離する際の識別性能は、複数のLLMと人間を区別する場合にどう変わるか。
RQ3特にストーリ作成においてGPT生成テキストの分類は他の場合より難しいか。
RQ4データセットサイズとジャンルがAIテキスト検出性能に与える影響は何か。

主な発見

MLモデルはジャンルを横断して人間対AI生成テキストを効果的に識別する。
二値タスク（人間対特定のLLM）では性能が堅調だが、複数のLLMを含むマルチクラス設定では劣化する。
特にストーリではGPT生成テキストの分類が他の場合より難しい。
サンプルサイズが限られているにもかかわらず、データセットは人間とAIテキスト、及び異なるLLM間の意味のある区別を支持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。