QUICK REVIEW

[論文レビュー] The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas

Chenglei Si, Takeshi Hashimoto|ArXiv.org|Jun 25, 2025

Artificial Intelligence in Healthcare and Education被引用数 3

ひとこと要約

論文は大規模な実行研究を実施し、LLMが生成したアイデアと人間のアイデアを比較。AI生成アイデアは実行後に質が低下しやすく、アイデア創出の優位性が縮小または逆転することを示しています。

ABSTRACT

Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.

研究の動機と目的

LLM生成の研究アイデアが人間のアイデアより実行成果を改善するかを評価する。
現実的な制約下で、アイデア創出の質が実行成果にどう関連するかを測定する。
AI主導のアイデア生成におけるアイデア創出と実行のギャップの要因を特定する。

提案手法

43名のエキスパートNLP研究者を募集し、人間またはAI（Claude-3.5-Sonnet）ソースからランダムに割り当てられたアイデアを実行させる。
参加者はアイデアの実行に平均約103時間を費やし、実験を文書化した4ページの論文を作成する。
アイデアソースはブラインド化されランダム化される。実行は標準化された指示と3か月のウィンドウに準拠。
専門の査読者（n=58）は、斬新さ、興奮、根拠の妥当性、有効性、総合、忠実度を含むルーブリックで実行プロジェクトをブラインド評価する。
アイデア創出スコアは先行研究から取得、実行スコアは実行前のスコアと比較してアイデア創出と実行のギャップを測定する。
研究デザインは事前登録され、データは公開される。

実験結果

リサーチクエスチョン

RQ1AI生成アイデアは人間生成アイデアと比較して実行成果を向上させるか？
RQ2AIと人間のアイデアでアイデア創出評価と実行評価はどう比較されるか？
RQ3AIアイデアのアイデア創出と実行のギャップの規模は人間アイデアと比べてどうか？
RQ4査読者は実行評価で ideation 評価には現れにくいどの要因をより重視するか？

主な発見

AIアイデアは創出前のアイデア創出（イデーション）評価で人間アイデアより高得点だが、実行後には新規性、興奮、効果、総合などの指標でより大きく低下する（p<0.05）。
実行評価全体で人間アイデアはAIアイデアより得点を維持する傾向があり、実行後に複数の指標で順位が縮小または逆転する。
アイデア創出と実行のギャップを比較すると、AIアイデアは新規性・興奮・効果・総合などで人間アイデアより大幅に低下する傾向があり、統計的有意差（FDR補正付きp値）を伴う。
いくつかのAI主導アイデアは特定の指標で実行時に人間アイデアを下回ることもあるが、サンプルサイズのため統計的有意性が常に成立するとは限らない。
実行評価の査読者は経験的なパフォーマンスと実験の厳格さを重視し、アイデア創出時には見られなかった弱点を特定することが多い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。