QUICK REVIEW

[論文レビュー] ChatGPT as Research Scientist: Probing GPT's Capabilities as a Research Librarian, Research Ethicist, Data Generator and Data Predictor

Steven A. Lehr, Aylin Caliskan|arXiv (Cornell University)|Jun 20, 2024

Artificial Intelligence in Healthcare and Education被引用数 5

ひとこと要約

この論文はGPT-3.5とGPT-4を4つの科学的役割（司書、倫理学者、データ生成者、データ予測者）で評価し、詐称の低減や倫理検出など一部の領域で性能改善を示す一方、未知データを予測する能力には限界があると指摘している。

ABSTRACT

How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more versus less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.

研究の動機と目的

研究用司書としてGPT-3.5とGPT-4を評価するために、文献リストの質と幻覚（誤情報）の発生率を検証する。
研究倫理学者として、欠陥のある研究実践の検出と是正を測定することで評価する。
データ生成者として、偏りの再現性と既知の結果をシミュレートする能力を検証する。
新規データ予測者として、見たことのない現実世界のデータパターンに対する予測をテストして評価する。

提案手法

研究1（司書）: 1,000件の文献を生成する（心理学25トピックあたり各トピック20件）、正確性・網羅性・関連性・引用数を評価する。
研究2（倫理学者）: 欠陥あるプロトコルの生々しいものと微妙なもの合わせて計18のビネットを提示し、216回の相互作用で倫理的/内省的品質についてGPTの応答を評価する。
研究3（データ生成者）: 単語埋め込み類似の関連を推定させ、4領域のWEAT風評価で既知の偏りパターンを再現させる。
研究4（新規データ予測者）: Project Implicitデータを用いて国レベルの態度を予測させる（ implicit と explicit の区別）ことで新規性と予測妥当性を評価する。
定量分析にはロジスティック回帰、Cronbachのα信頼性、現実世界データとの相関分析を含む。

実験結果

リサーチクエスチョン

RQ1GPTは幻覚を起こさずに、包括的で正確な文献目録を信頼性高く作成できるか。
RQ2研究プロトコルにおける倫理問題やp-hackingのような手法をGPTはどの程度検出・対処できるか。
RQ3GPTは既知のデータパターン（偏り・ステレオタイプ）をどの程度再現し、もっともらしいデータを生成できるか。
RQ4GPTは訓練データ外の新規の経験的パターンを予測できるか、GPT-3.5とGPT-4で性能はどう異なるか。
RQ5一般的な科学アシスタントとしてのGPTの有用性の限界と今後の展望はどこにあるか。

主な発見

GPT-3.5は文献の36.0%を幻覚、GPT-4は5.4%を幻覚させ、GPT-4はフィクションについての率直さが向上し、フィクションの参照においては84.3%がフィクションを認める一方、GPT-3.5は12.2%だった。
倫理的ビネット回答でGPT-4はGPT-3.5を上回り、露骨な場面は8.86/10、微妙な場面は7.26/10、対してそれぞれGPT-3.5が5.39/10と4.05/10だった。
GPTはデータ生成において既知の偏りパターン（例：WEAT様の結果）を再現し、確立された結果をシミュレートできることを安定して示し、パイロットデータ生成と仮説生成の活用を支持する。
新規データ予測タスクでは、GPT-3.5とGPT-4は高度に新規なデータを予測する能力が限られており、実世界結果との相関は変動し、新規な暗黙的態度に対しては低下した。
データ倫理をテーマにした prompting が応答品質を向上させ、倫理を前提としたプロンプトは倫理要素のないプロンプトより高品質な出力を生んだ。
総じて、GPTは欠陥はあるが改善している司書であり、合理的な倫理ツールで、単純な領域のデータ生成は可能だが、新規の経験的パターンを予測する能力は低い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。