[论文解读] The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
该论文进行了一项大规模执行研究,将LLM生成的想法与人类想法进行比较,发现AI生成的想法在执行后质量下降更大,削弱或逆转了其在构思想象上的优势。
Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.
研究动机与目标
- 评估LLM生成的研究想法是否在执行结果上优于人类想法。
- 衡量在现实约束下,构思想象质量与执行结果之间的关系。
- 识别在AI驱动的想法生成中导致构思-执行差距的因素。
提出的方法
- 招募43名资深NLP研究人员对来自Human或AI(Claude-3.5-Sonnet)源的随机分配想法进行执行。
- 参与者平均投入 ~103 小时执行该想法,并撰写4页论文记录实验。
- 想法源被盲化并随机化;执行遵循标准化指令和三个月窗口。
- 专家评审者(n=58)对已执行的项目进行盲评,使用包含新颖性、激动性、合理性、有效性、总体性和忠实性在内的评估量表。
- 构思想象分来自先前研究;执行分与执行前分进行对比以衡量构思-执行差距。
- 研究设计为预注册,数据公开发布。
实验结果
研究问题
- RQ1AI生成的想法是否在执行结果上优于人类生成的想法?
- RQ2AI与人类的构思想象评估与执行评估在各自比较中有何差异?
- RQ3相对于人类想法,AI想法的构思想象-执行差距有多大?
- RQ4评审在执行评估中更看重哪些在构思想象评估中未体现的因素?
主要发现
- AI想法在构思想象(执行前)评估中分数高于人类想法,但在执行后在新颖性、激动性、有效性和总体等指标上下降幅度更大(p<0.05)。
- 在执行评审中,人类想法的分数保持得更好,导致多项指标在执行后排名下降或被拉低。
- 在对比构思想象与执行的差距时,AI想法显示出比人类想法更大的下降幅度(如新颖性、激动性、有效性、总体),且具有统计显著性差异(p值经FDR校正)。
- 在某些指标上,某些AI驱动的想法在执行中的排名甚至低于人类想法,尽管由于样本量原因未必总是统计学显著。
- 在执行评估中,评审者考虑到经验表现与实验的严谨性,常识别出构思想象阶段未出现的弱点。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。