QUICK REVIEW

[論文レビュー] Testing the Reliability of ChatGPT for Text Annotation and Classification: A Cautionary Remark

Michael Reiss|arXiv (Cornell University)|Apr 17, 2023

Artificial Intelligence in Healthcare and Education被引用数 31

ひとこと要約

この論文は、テキスト注釈と分類における ChatGPT のゼロショット信頼性を分析し、出力がプロンプト、温度、反復によって一貫性がない可能性があることを示し、無監督使用に対する慎重さと検証を推奨する。

ABSTRACT

Recent studies have demonstrated promising potential of ChatGPT for various text annotation and classification tasks. However, ChatGPT is non-deterministic which means that, as with human coders, identical input can lead to different outputs. Given this, it seems appropriate to test the reliability of ChatGPT. Therefore, this study investigates the consistency of ChatGPT's zero-shot capabilities for text annotation and classification, focusing on different model parameters, prompt variations, and repetitions of identical inputs. Based on the real-world classification task of differentiating website texts into news and not news, results show that consistency in ChatGPT's classification output can fall short of scientific thresholds for reliability. For example, even minor wording alterations in prompts or repeating the identical input can lead to varying outputs. Although pooling outputs from multiple repetitions can improve reliability, this study advises caution when using ChatGPT for zero-shot text annotation and underscores the need for thorough validation, such as comparison against human-annotated data. The unsupervised application of ChatGPT for text annotation and classification is not recommended.

研究の動機と目的

実世界の News vs. Not News タスクにおけるテキスト注釈と分類のための ChatGPT のゼロショット信頼性を評価する。
モデルパラメータ（temperature）、プロンプトのバリエーション、および繰り返し入力が一貫性に与える影響を検討する。
繰り返しからの出力をプーリングすることで、科学的に受け入れ可能な閾値まで信頼性が向上するかを評価する。
自動注釈ソフトウェアでの ChatGPT の使用への影響と、徹底的な検証の必要性を強調する。

提案手法

OpenAI API を介して gpt-3.5-turbo を使用し、234 件のドイツ語話者サイトのテキストを News または Not News に分類する。
人間がコード化したコードブックと短い代替案に基づく、10 個の異なる指示（prompt variations）を作成する。
2つの temperature 設定（0.25 および 1）を、46,800 入力（2340 prompts x 10 repetitions x 2 temps）でテストする。
Krippendorff’s Alpha を用いて一貫性を測定する。 (i) プーリングなし、(ii) 3 回の反復の多数決、(iii) 10 回の反復の多数決。
異なるプロンプト間および同一入力の繰り返し出力を比較して、 intra- および inter-prompt 信頼性を評価する。

実験結果

リサーチクエスチョン

RQ1同じ入力に対して、異なるプロンプト間で ChatGPT の分類はどれくらい一貫しているか？
RQ2temperature 設定が ChatGPT のゼロショット注釈の信頼性にどう影響するか？
RQ3複数回の繰り返しからの出力をプーリングすることで信頼性は向上し、どの程度までか？
RQ4同一設定の下で同一入力を繰り返す場合に意味のある一貫性はあるか？
RQ5自然言語注釈ワークフローにおける ChatGPT の使用への影響は何か？

主な発見

プーリングしない場合、2つの temperature 設定間の一貫性は信頼性閾値を下回ることがある（Alpha = 0.75）。
10 回の繰り返しをプーリングすると、温度をまたぐ同一プロンプトでの一貫性が Alpha = 0.91 に向上。
指示の表現を変えると一貫性が低くなる（プーリングしても Alpha が 0.6 を超えない）。
同一入力の繰り返し内では、低温度の方が一貫性が高い（Alpha > 0.9）。一方、高温度では最良の領域で Alpha が約 0.85。
総じて、ゼロショット分類は信頼性に欠けることがあり、人間が注釈したデータとの検証を要する。無監督使用は推奨されない。）

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。