QUICK REVIEW

[論文レビュー] Diminished Diversity-of-Thought in a Standard Large Language Model

Peter S. Park, Philipp Schoenegger|arXiv (Cornell University)|Feb 13, 2023

Computational and Text Analysis Methods被引用数 10

ひとこと要約

この論文はGPT-3.5を社会科学の再現研究における人間の参加者の代理として検証し、特定のプロンプトが回答のばらつきをほぼゼロに近づける現象を記録しており、LLMを人間の主題の一般的な代替としての妥当性に疑問を投げかける。

ABSTRACT

We test whether Large Language Models (LLMs) can be used to simulate human participants in social-science studies. To do this, we run replications of 14 studies from the Many Labs 2 replication project with OpenAI's text-davinci-003 model, colloquially known as GPT3.5. Based on our pre-registered analyses, we find that among the eight studies we could analyse, our GPT sample replicated 37.5% of the original results and 37.5% of the Many Labs 2 results. However, we were unable to analyse the remaining six studies due to an unexpected phenomenon we call the "correct answer" effect. Different runs of GPT3.5 answered nuanced questions probing political orientation, economic preference, judgement, and moral philosophy with zero or near-zero variation in responses: with the supposedly "correct answer." In one exploratory follow-up study, we found that a "correct answer" was robust to changing the demographic details that precede the prompt. In another, we found that most but not all "correct answers" were robust to changing the order of answer choices. One of our most striking findings occurred in our replication of the Moral Foundations Theory survey results, where we found GPT3.5 identifying as a political conservative in 99.6% of the cases, and as a liberal in 99.3% of the cases in the reverse-order condition. However, both self-reported 'GPT conservatives' and 'GPT liberals' showed right-leaning moral foundations. Our results cast doubts on the validity of using LLMs as a general replacement for human participants in the social sciences. Our results also raise concerns that a hypothetical AI-led future may be subject to a diminished diversity-of-thought.

研究の動機と目的

標準的なLLM（GPT-3.5）が社会科学の再現研究において人間の参加者をシミュレートできるかを評価する。
複数の課題にわたりMany Labs 2プロジェクトと比較した再現成功を測定する。
LLMの応答における思考の多様性を損なう現象を特定・特徴づける。

提案手法

OpenAI GPT-3.5（text-davinci-003）を用いてMany Labs 2の研究14件を再現する。
元データおよびMany Labs 2の結果とGPT出力を比較する事前登録済み分析。
分析可能な8件の研究を分析し再現率を報告し、回答の予期せぬほぼゼロのばらつきを文書化する。
回答の頑健性を検証するため、人口統計の詳細とプロンプト順序を変えた探索的追跡調査を実施する。
異なるプロンプト条件の下でMoral Foundations Theoryの調査結果を検討する。

実験結果

リサーチクエスチョン

RQ1GPT-3.5は元のMany Labs 2の結果のかなりの部分を再現できるか。
RQ2GPT-3.5の応答には人間の参加者の代替として有効な多様性が十分に見られるか。
RQ3LLMsを社会科学の再現研究における使用を制限する現象（例:「正解」効果）はどのように現れるか。
RQ4人口統計や回答順のプロンプトの変更に対してGPT-3.5由来の結論はどれくらい頑健か。
RQ5思考の多様性に関するAI主導の将来シナリオへの含意は何か。

主な発見

GPT-3.5は分析可能な8件の研究で元の結果の37.5%を再現した。
GPT-3.5はMany Labs 2の結果の37.5%を再現した。
予期せぬ「正解」効果により微妙な質問に対して回答のばらつきがゼロまたはほぼゼロとなった。
探索的追跡調査では、前置きとなる人口統計の変動に対して「正解」は頑健だった。
Moral Foundations Theoryの再現で、GPT-3.5は反転順序のケースの99.6%で保守的と特定され、99.3%でリベラルと特定されたが、両グループとも道徳的基盤は右寄りを示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。