QUICK REVIEW

[논문 리뷰] The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas

Chenglei Si, Takeshi Hashimoto|ArXiv.org|2025. 06. 25.

Artificial Intelligence in Healthcare and Education인용 수 3

한 줄 요약

이 논문은 LLM이 생성한 아이디어와 인간 아이디어를 비교하는 대규모 실행 연구를 수행하여, AI 생성 아이디어가 실행 이후 품질이 더 하락해 아이디어 창출의 우위를 감소시키거나 뒤집는다는 결론.

ABSTRACT

Large Language Models (LLMs) have shown promise in accelerating the scientific research pipeline. A key capability for this process is the ability to generate novel research ideas, and prior studies have found settings in which LLM-generated research ideas were judged as more novel than human-expert ideas. However, a good idea should not simply appear to be novel, it should also result in better research after being executed. To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study by recruiting 43 expert researchers to execute randomly-assigned ideas, either written by experts or generated by an LLM. Each expert spent over 100 hours implementing the idea and wrote a 4-page short paper to document the experiments. All the executed projects are then reviewed blindly by expert NLP researchers. Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas on all evaluation metrics (novelty, excitement, effectiveness, and overall; p < 0.05), closing the gap between LLM and human ideas observed at the ideation stage. When comparing the aggregated review scores from the execution study, we even observe that for many metrics there is a flip in rankings where human ideas score higher than LLM ideas. This ideation-execution gap highlights the limitations of current LLMs in generating truly effective research ideas and the challenge of evaluating research ideas in the absence of execution outcomes.

연구 동기 및 목표

LLM-생성 연구 아이디어가 인간 아이디어보다 더 나은 실행 결과를 낳는지 평가한다.
현실적 제약 하에서 아이디어 창출 품질이 실행 결과와 어떤 관계가 있는지 측정한다.
AI 구동 아이디어 생성에서 아이디어-실행 격차에 기여하는 요인을 식별한다.

제안 방법

인간 또는 AI(Claude-3.5-Sonnet) 소스에서 무작위로 할당된 아이디어를 실행하기 위해 43명의 전문 NLP 연구원을 모집한다.
참가자들은 아이디어 실행에 평균 약 103시간을 들이고 실험을 기록한 4페이지 분량의 논문을 작성한다.
아이디어 소스는 블라인드 처리 및 무작위화되며, 실행은 표준화된 지침과 3개월 창을 준수한다.
전문 평가자(n=58)가 실행된 프로젝트를 블라인드 리뷰하며, 새로움, 흥미, 타당성, 효과성, 전반적, 충실성 등을 포함하는 루브릭을 사용한다.
아이디어 창출 점수는 선행 연구에서 가져오고, 실행 점수는 실행 전 점수와 비교하여 아이디어-실행 격차를 측정한다.
연구 설계는 사전 등록되었으며 데이터는 공개적으로 공개된다.

실험 결과

연구 질문

RQ1AI 생성 아이디어가 인간 생성 아이디어에 비해 더 나은 실행 결과로 이어지는가?
RQ2AI 대 인간 아이디어에 대해 아이디어 창출 평가와 실행 평가가 어떻게 비교되는가?
RQ3AI 아이디어의 아이디어-실행 격차의 크기는 인간 아이디어에 비해 어느 정도인가?
RQ4실행 평가에서 평가자들이 아이디어 창출 평가에는 나타나지 않는 어떤 요인들을 더 중시하는가?

주요 결과

AI 아이디어는 아이디어 창출(실행 전) 평가에서 인간 아이디어보다 높은 점수를 받지만, 실행 후 새로움, 흥미, 효과성, 전반 지표에서 더 크게 하락한다(p<0.05).
실행 평가에서 인간 아이디어가 AI 아이디어보다 점수를 더 잘 유지하여 실행 후 여러 지표에서 순위가 감소하거나 역전된다.
아이디어 창출 대비 실행 격차를 비교하면 AI 아이디어는 새로움, 흥미, 효과성, 전반 등에서 더 큰 하락을 보이며 인간 아이디어보다統계적으로 유의한 차이가 있다(FDR 보정된 p값 보고).
일부 AI 구동 아이디어는 특정 지표에서 실행 시 인간 아이디어보다 낮은 순위를 차지하기도 하지만 표본 크기로 인해 항상 통계적으로 유의하지는 않다.
실행 평가의 평가자들은 경험적 성능과 실험의 엄밀성을 고려하며, 아이디어 창출 단계에서 드러나지 않는 약점을 종종 지적한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.