QUICK REVIEW

[论文解读] Assessing Reproducibility in Evolutionary Computation: A Case Study using Human- and LLM-based Assessment

Francesca Da Ros, Tarik Zaciragic|arXiv (Cornell University)|Feb 5, 2026

Scientific Computing and Data Management被引用 0

一句话总结

简要结论：本文通过对 GECCO 的 Evolutionary Combinatorial Optimization and Metaheuristics（ECOM）赛道过去十年的论文，使用人工核对表和基于LLM的RECAP管道，分析可复现性实践，报告平均可复现性完整性为0.62、36.90%的工件提供率，以及自动评估与人工评估之间的显著一致性（κ=0.67）。

ABSTRACT

Reproducibility is an important requirement in evolutionary computation, where results largely depend on computational experiments. In practice, reproducibility relies on how algorithms, experimental protocols, and artifacts are documented and shared. Despite growing awareness, there is still limited empirical evidence on the actual reproducibility levels of published work in the field. In this paper, we study the reproducibility practices in papers published in the Evolutionary Combinatorial Optimization and Metaheuristics track of the Genetic and Evolutionary Computation Conference over a ten-year period. We introduce a structured reproducibility checklist and apply it through a systematic manual assessment of the selected corpus. In addition, we propose RECAP (REproducibility Checklist Automation Pipeline), an LLM-based system that automatically evaluates reproducibility signals from paper text and associated code repositories. Our analysis shows that papers achieve an average completeness score of 0.62, and that 36.90% of them provide additional material beyond the manuscript itself. We demonstrate that automated assessment is feasible: RECAP achieves substantial agreement with human evaluators (Cohen's k of 0.67). Together, these results highlight persistent gaps in reproducibility reporting and suggest that automated tools can effectively support large-scale, systematic monitoring of reproducibility practices.

研究动机与目标

评估 ECOM 论文中可复现性报告的完整性。
识别哪些可复现性项被持续报告或系统性省略。
分析可复现性工件如何提供、维护和随时间访问。
评估自动化的 RECAP 评估与人工判断的一致性程度。

提出的方法

基于 ACM 标准，制定针对 EC 实验的结构化可复现性核对表。
对 2016–2025 年的 168 篇 ECOM 论文进行人工评估，提取工件并评估重新运行的可行性。
设计 RECAP，一种基于 LLM 的管道，利用核对表自动评估论文文本与代码仓库中的可复现性信号。
处理 PDF 文本、提取工件、在沙箱中运行代码以评估在可行范围内的功能可执行性。
将自动化的 RECAP 结果与人工评估进行对比，以衡量一致性并识别分歧。
将评估建立在 ACM 工件指南和开放科学实践之上，同时保持可解释性。

Figure 1 . BPMN model of the manual assessment of a paper.

实验结果

研究问题

RQ1RQ1 ECOM 论文的总体可复现性完整性在各年份之间有何变化？
RQ2RQ2 哪些可复现性项最具一致性地被报告，哪些被系统性省略？
RQ3RQ3 ECOM 论文如何提供可复现性工件，这些做法的稳定性如何？
RQ4RQ4 最佳论文候选是否在可复现性报告模式上与其他论文存在差异？
RQ5RQ5 RECAP 在多大程度上能够复现人工判断？

主要发现

中位论文级别的可复现性完整性为 0.61，2020 年后呈现总体上升趋势。
168 篇论文中只有 3 篇（1.79%）在不联系作者的情况下提供足以进行完整再现的材料。
36.90% 的论文提供至少一个可复现性工件；工件提供具有时间变异性，且有时链接不可用。
仅 15%（25/168）使用自动化参数调优；近半数报告统计检验，但仅有 13.69% 应用多重检验校正。
入选最佳论文的论文在可复现性完整性上略高，但差异并非统计显著；在提名者和获奖者中，工件提供性更高。
RECAP 的每篇论文平均准确率为 76.8%，与人工评估的 Cohen’s κ 为 0.67，显示出显著的一致性。

Figure 2 . RECAP system overview. The system processes each paper through a field-by-field evaluation loop. Based on field type, it either uses the paper text directly (Std), retrieves cached best paper website data (BP), or processes linked repositories (Art). Each field is evaluated by an LLM with

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。