QUICK REVIEW

[論文レビュー] AI, write an essay for me: A large-scale comparison of human-written versus ChatGPT-generated essays

Steffen Herbold, Annette Hautli-Janisz|arXiv (Cornell University)|Apr 24, 2023

Artificial Intelligence in Healthcare and Education被引用数 17

ひとこと要約

この研究は、人間が書いたエッセイとChatGPTが生成した論争的エッセイを系統的に比較し、特にGPT-4において人間よりも総合的な質で上回ることを示す。モデル間で異なる言語パターンが見られる。

ABSTRACT

Background: Recently, ChatGPT and similar generative AI models have attracted hundreds of millions of users and become part of the public discourse. Many believe that such models will disrupt society and will result in a significant change in the education system and information generation in the future. So far, this belief is based on either colloquial evidence or benchmarks from the owners of the models -- both lack scientific rigour. Objective: Through a large-scale study comparing human-written versus ChatGPT-generated argumentative student essays, we systematically assess the quality of the AI-generated content. Methods: A large corpus of essays was rated using standard criteria by a large number of human experts (teachers). We augment the analysis with a consideration of the linguistic characteristics of the generated essays. Results: Our results demonstrate that ChatGPT generates essays that are rated higher for quality than human-written essays. The writing style of the AI models exhibits linguistic characteristics that are different from those of the human-written essays, e.g., it is characterized by fewer discourse and epistemic markers, but more nominalizations and greater lexical diversity. Conclusions: Our results clearly demonstrate that models like ChatGPT outperform humans in generating argumentative essays. Since the technology is readily available for anyone to use, educators must act immediately. We must re-invent homework and develop teaching concepts that utilize these AI models in the same way as math utilized the calculator: teach the general concepts first and then use AI tools to free up time for other learning objectives.

研究の動機と目的

専門評価者（教師）の大規模なプールを用いて、AI生成の論証エッセイと人間が書いたエッセイの品質を評価する。
二つのChatGPTバージョン（GPT-3.5およびGPT-4）における人間とAI生成エッセイの言語的差異を特徴づける。
信頼性検査と言語特徴の相関を含む、エッセイ品質の統計的に厳密な分析を提供する。

提案手法

オンラインフォーラムから90のトピックに関する大量の学生エッセイ（人間が書いたもの）を収集する。
同じトピックについて、基本的なゼロショットプロンプトを用いてChatGPT-3とChatGPT-4に約200語のエッセイを生成させる。
108人の教師に対して、270編のエッセイを7つの基準で7点リッカート尺度を用いて658件の評価を行い、評価者間信頼性を算出する。
語彙多様性、統語的複雑さ、名詞化、法助動詞、エピステミックマーカー、談話マーカーに関する計算言語学的分析を実施する。
Holm-Bonferroni補正を用いたWilcoxonの符号付き順位検定を実施し、多重比較を補正し、効果量としてCohenのdを報告する。ブートストラップに基づく信頼区間。
入手可能な再現パッケージを用いて分析を再現する。

実験結果

リサーチクエスチョン

RQ1RQ1: GPT-3およびGPT-4をベースとしたChatGPTは、学生の論証エッセイを書く能力がどれくらい良いか？
RQ2RQ2: AI生成エッセイは人間が書いたエッセイとどのように比較されるか？
RQ3RQ3: 人間とAI生成コンテンツに特徴的な言語的手法は何か？

主な発見

ChatGPT生成エッセイは、すべての基準で人間が書いたエッセイよりも品質が高く評価され、GPT-4はGPT-3.5を上回る。
GPT-4は、論理的構造、言語的複雑さ、語彙の豊かさ、テキストの連結性でGPT-3.5より高い性能を示す。
人間はより多くの法助動詞とエピステミックマーカーを用いる一方、GPTモデルはより多くの名詞化を用い、文の複雑さを示す。
言語的多様性は時間とともに向上し、GPT-4は人間より多様性が高い一方、GPT-3.5は人間に比べ多様性が劣る。
GPT-4とGPT-3.5の差は、論理、語彙の結びつき、そして複雑さにおいて有意であり、GPT-4の広範な改善を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。