QUICK REVIEW

[論文レビュー] Same Prompt, Different Outcomes: Evaluating the Reproducibility of Data Analysis by LLMs

Jiaxin Cui, Rohan Alexander|arXiv (Cornell University)|Feb 15, 2026

Topic Modeling被引用数 0

ひとこと要約

この論文は、モデル、プロンプト、温度ごとにLLM生成データ分析の再現性を系統的に評価し、同一設定でも大きなばらつきがあり得ることを示し、複数の独立実行を推奨する。

ABSTRACT

We systematically evaluate the reproducibility of data analysis conducted by Large Language Models (LLMs). We evaluate two prompting strategies, six models, and four temperature settings, with ten independent executions per configuration, yielding 480 total attempts. We assess the completion, concordance, validity, and consistency of each attempt and find considerable variation in the analytical results even for consistent configurations. This suggests, as with human data analysis, the data analysis conducted by LLMs can vary, even given the same task, data, and settings. Our results mean that if an LLM is being used to conduct data analysis, then it should be run multiple times independently and the distribution of results considered.

研究の動機と目的

LLM生成データ分析の再現性を研究する必要性と、それが科学的発見に与える影響を動機づける。
prompting戦略（単一步骤 vs 複数步骤）、モデル、温度が再現性にどう影響するかを評価する。
五段階データ処理パイプラインを用いて、完了度、一致性、妥当性、整合性を構成ごとに定量化する。
LLMベースの分析で複数実行を行い、結果分布を検討することの指針を提供する。

提案手法

三つの提供元（Anthropic、OpenAI、Google）から六つのモデルを、二つの prompting戦略（単一步骤、複数步骤）、四つの温度（0.0、0.3、0.7、1.0、GPT-5-miniはデフォルトで1.0）で評価する。
New Brunswickの約束データに五段階データ分析パイプラインを適用し、構成ごとに十回の独立実行を行い、総計480回の試行を実施する。
各試行について、完了度、一致性、妥当性、整合性の四指標を評価し、コード実行、人間の分析との整合、データタイプ、回帰結果などの出力を分析する。
五段階パイプラインを使用する：CSVの統合、再約定の識別、組織-yearの要約への集計、OLS回帰、可視化の生成。
複数步骤の prompting は誤差伝搬を生みやすいことを記録する一方、単一步骤の prompts は多くの設定で完了度が高く、パイプラインの一貫性も高い。
分析にはRを用い、tidyverseとtinytableを用いて分析、表、図を作成する。

Figure 1 : Evaluation metrics across pipeline steps, models, temperatures, and prompting strategies. Each tile shows the rate for one model-step combination. Rows are grouped by prompting strategy, columns by temperature. Color intensity indicates the metric value from 0 (red) to 1 (green). GPT-5-mi

実験結果

リサーチクエスチョン

RQ1同じタスク・データ・設定を複数回実行した場合、LLM生成のデータ分析はどれくらい再現性があるか。
RQ2誤差伝搬のため、単一步骤の prompts のほうが複数步骤より信頼性の高い出力を生むのか、またモデルと温度はこれにどう影響するのか。
RQ3LLM生成の分析は人間の分析とどの程度一致（concordance）し、パイプラインの各段階で妥当性基準を満たすのか。
RQ4LLM生成パイプライン内のデータ前処理の意思決定は、回帰係数の傾きやt統計量といった下流の推定にどのように影響するのか。

主な発見

単一步 prompting は、誤差伝搬が少ないため、一般に複数步骤 prompting より完了率が高い。
生成コードは構造的に妥当だが、データ前処理の選択（ソート、欠損、包含）は人間の分析と異なり、下流の変動を生み出す。
実行ごとに推定値のクラスターが現れ、傾きとt統計量は符号と大きさが変動することがあり、多くの設定でt統計量は有意でない。
ほとんどの設定でt統計量はゼロ付近になるが、いくつかの単一步骤設定では潜在的に有意な結果を生む場合がある。しかし、多数の実行を通じて変動性が単一の決定的結論を覆す。
実行間の一貫性は完全ではなく、同一設定でも出力が異なることがあり、複数の独立した実行が必要であることを強調する。
本研究は結果の分布を評価し、可能であればエン ensembles や複数プロバイダの比較を用いて変動性を考慮することを推奨する。

Figure 2 : Comparison of LLM-estimated reappointment rates to those from human analysis at the department-year level. Each point is one department-year observation from one execution. The dashed 45-degree line indicates the estimates are the same. GPT-5-mini is only evaluated at its default temperat

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。