QUICK REVIEW

[論文レビュー] Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Jiaan Wang, Yunlong Liang|arXiv (Cornell University)|Mar 7, 2023

Topic Modeling被引用数 12

ひとこと要約

この論文は、ChatGPTを一般的なNLG評価者として予備的に評価し、複数のメタ評価データセットにおいて人間の判断と高い相関を示し、プロンプトとデータセットの偏りの影響を受ける。

ABSTRACT

Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of natural language generation (NLG) models is an arduous task and NLG metrics notoriously show their poor correlation with human judgments, we wonder whether ChatGPT is a good NLG evaluation metric. In this report, we provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to evaluate the generated results of NLG models. We conduct experiments on five NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks). Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments in most cases. In addition, we find that the effectiveness of the ChatGPT evaluator might be influenced by the creation method of the meta-evaluation datasets. For the meta-evaluation datasets which are created greatly depending on the reference and thus are biased, the ChatGPT evaluator might lose its effectiveness. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.

研究の動機と目的

ChatGPT が一般的な NLG 評価指標になり得るかを動機づけ、評価できるかを検証する。
従来の自動評価指標の限界と、参照自由および参照付き評価者としての ChatGPT の潜在能力を指摘する。
タスク固有およびアスペクト固有のプロンプトが、要約、物語生成、データからテキストへの変換といったタスクにおける ChatGPT の NLG 出力の評価判断にどのように影響するかを調査する。
データセット構築のバイアスが、ChatGPT を評価指標として用いる際の有効性にどのように影響するかを検討する。

提案手法

ChatGPT を人間の評価者として扱い、タスク特有およびアスペクト特有のプロンプトを適用してスコアや評価を生成する。
ChatGPT ベースの評価を標準的な自動指標（ROUGE、BERTScore、MoverScore、PRISM、BARTScore など）と比較する。
参照なしプロンプト（DA およびスター評価）と参照ありプロンプト（ゴールデン・リファレンス付き）の両方を用いて、採点を導く。
要約、物語生成、データからテキストへのタスクを横断する5つのメタ評価データセットを用いて評価する。
サンプルレベルおよびデータセットレベルの指標（例：Spearman、Pearson、Kendall）を用いて人間の判断との相関を分析する。
プロンプト設計とデータセット構築の偏りが ChatGPT 評価者の性能へ及ぼす影響を評価する。

実験結果

リサーチクエスチョン

RQ1Is ChatGPT capable of correlating with human judgments as a general NLG evaluator across multiple tasks?
RQ2How do task-specific and aspect-specific prompts affect ChatGPT's evaluation reliability?
RQ3Do biases in meta-evaluation datasets influence the effectiveness of ChatGPT as an NLG metric?
RQ4How does ChatGPT compare to established automatic metrics across summarization, story generation, and data-to-text tasks?

主な発見

ChatGPT は複数のメタ評価データセットで人間の判断と高い相関を達成し、特に物語生成と要約の文脈で顕著である。
ChatGPT は複数のタスクで人間の判断との相関において従来の自動指標を上回すことが多く、一般的な NLG 指標としての潜在能力を示している。
評価者の有効性はプロンプト設計に敏感で、タスクやアスペクトごとに特化したプロンプトが必要である。
参照性に偏ったデータセットは ChatGPT の有効性を低下させる可能性があり、語彙的バイアスが参照ベースのシグナルを有利にすることがある。
データからテキストへの評価でも競争力のある性能を示し、要約や物語生成を超えた広い適用性を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。