QUICK REVIEW

[論文レビュー] Assessing Reproducibility in Evolutionary Computation: A Case Study using Human- and LLM-based Assessment

Francesca Da Ros, Tarik Zaciragic|arXiv (Cornell University)|Feb 5, 2026

Scientific Computing and Data Management被引用数 0

ひとこと要約

論文は manual checklist と LLM ベースの RECAP パイプラインを用いて GECCO の Evolutionary Combinatorial Optimization and Metaheuristics (ECOM) トラックの再現性慣行を十年間分析し、再現性の完全性の平均を0.62、アーティファクト提供率を36.90%、自動評価と人間評価の間で実質的な一致（κ=0.67）を報告します。

ABSTRACT

Reproducibility is an important requirement in evolutionary computation, where results largely depend on computational experiments. In practice, reproducibility relies on how algorithms, experimental protocols, and artifacts are documented and shared. Despite growing awareness, there is still limited empirical evidence on the actual reproducibility levels of published work in the field. In this paper, we study the reproducibility practices in papers published in the Evolutionary Combinatorial Optimization and Metaheuristics track of the Genetic and Evolutionary Computation Conference over a ten-year period. We introduce a structured reproducibility checklist and apply it through a systematic manual assessment of the selected corpus. In addition, we propose RECAP (REproducibility Checklist Automation Pipeline), an LLM-based system that automatically evaluates reproducibility signals from paper text and associated code repositories. Our analysis shows that papers achieve an average completeness score of 0.62, and that 36.90% of them provide additional material beyond the manuscript itself. We demonstrate that automated assessment is feasible: RECAP achieves substantial agreement with human evaluators (Cohen's k of 0.67). Together, these results highlight persistent gaps in reproducibility reporting and suggest that automated tools can effectively support large-scale, systematic monitoring of reproducibility practices.

研究の動機と目的

ECOM 論文の再現性報告の完全性を評価する。
再現性項目のうちどれが一貫して報告され、どれが体系的に省略されているかを特定する。
再現性アーティファクトがどのように提供・維持・時間経過とともにアクセス可能であるかを分析する。
自動 RECAP 評価が人間の判断とどの程度一致するかを評価する。

提案手法

ACM 標準に基づく EC 実験向けの構造化再現性チェックリストを開発する。
2016–2025 の 168 件の ECOM 論文に対して手作業で適用し、アーティファクトを抽出し再実行の実現性を評価する。
チェックリストを用いて論文本文とコードリポジトリから再現性シグナルを自動評価する LLM ベースのパイプライン RECAP を設計する。
PDF テキストを処理しアーティファクトを抽出、実行可能性がある場合はサンドボックスでコードを実行する。
自動 RECAP 結果と手動評価を比較して一致度を測定し、乖離を特定する。
評価を ACM アーティファクト指針とオープンサイエンス実践に準拠させつつ解釈性を保つ。

Figure 1 . BPMN model of the manual assessment of a paper.

実験結果

リサーチクエスチョン

RQ1RQ1 ECOM 論文の全体的な再現性完全性は年ごとにどう変化するか。
RQ2RQ2 最も一貫して報告される再現性項目と、体系的に省略される項目はどれか。
RQ3RQ3 ECOM 論文はどのように再現性アーティファクトを提供し、これらの実践はどれくらい安定しているか。
RQ4RQ4 最良論文候補は他の論文と比較して再現性報告パターンが異なるか。
RQ5RQ5 RECAP は人間の判断をどの程度再現できるか。

主な発見

論文レベルの再現性完全性の中央値は 0.61 で、2020年以降に全体的に上昇傾向。
168 論文中、著者へ連絡せずに完全再現の材料を提供できるものはわずか 3 件（1.79%）のみ。
論文の 36.90% が少なくとも一つの再現性アーティファクトを提供；アーティファクト提供は時間によって変動し、リンクが利用不可となることがある。
自動パラメータ調整を用いる論文は 15%（25/168 のうち）、統計的検定を報告する論文はほぼ半数だが、多重検定補正を適用しているのは 13.69% にとどまる。
指名されたベスト論文は完全性がやや高い傾向だが有意差は統計的に有意ではない；アーティファクト提供は候補者と受賞者で高い。
RECAP は論文ごとの平均正確さ 76.8% を達成し、ヒトの評価との Cohen’s κ は 0.67 で実質的な一致を示す。

Figure 2 . RECAP system overview. The system processes each paper through a field-by-field evaluation loop. Based on field type, it either uses the paper text directly (Std), retrieves cached best paper website data (BP), or processes linked repositories (Art). Each field is evaluated by an LLM with

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。