QUICK REVIEW

[論文レビュー] An Audit of Machine Learning Experiments on Software Defect Prediction

Giuseppe Destefanis, Leila Yousefi|arXiv (Cornell University)|Jan 26, 2026

Software Engineering Research被引用数 0

ひとこと要約

この論文は2019年から2023年の最近の SDP 実験を対象に、実験設計、分析、報告、および再現性を監査し、101件の論文をサンプリングして実務の大きなばらつきと再現性のギャップを明らかにする。

ABSTRACT

Background: Machine learning algorithms are widely used to predict defect prone software components. In this literature, computational experiments are the main means of evaluation, and the credibility of results depends on experimental design and reporting. Objective: This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices against accepted norms from statistics, machine learning, and empirical software engineering. The aim is to characterise current practice and assess the reproducibility of published results. Method: We audited SDP studies indexed in SCOPUS between 2019 and 2023, focusing on design and analysis choices such as outcome measures, out of sample validation strategies, and the use of statistical inference. Nine study issues were evaluated. Reproducibility was assessed using the instrument proposed by González Barahona and Robles. Results: The search identified approximately 1,585 SDP experiments published during the period. From these, we randomly sampled 101 papers, including 61 journal and 40 conference publications, with almost 50 percent behind paywalls. We observed substantial variation in research practice. The number of datasets ranged from 1 to 365, learners or learner variants from 1 to 34, and performance measures from 1 to 9. About 45 percent of studies applied formal statistical inference. Across the sample, we identified 427 issues, with a median of four per paper, and only one paper without issues. Reproducibility ranged from near complete to severely limited. We also identified two cases of tortured phrases and possible paper mill activity. Conclusions: Experimental design and reporting practices vary widely, and almost half of the studies provide insufficient detail to support reproduction. The audit indicates substantial scope for improvement.

研究の動機と目的

確立された基準に照らして、SDP実験設計、分析、および報告の現状を評価する。
2019–2023年のSDP実験の書誌計量学的および方法論的な状況を特徴づける。
確立された手法を用いてSDP研究の再現性の見通しを評価する。
SDP研究コミュニティの改善を促すため、共通の品質問題と報告のギャップを特定する。

提案手法

Scopus (2019–2023) からの SDP 実験の系統的監査を実施する。
層化乱択サンプルを用いて101件の論文を選定する（61誌、40会議）。
González-Barahona と Robles の再現性指標を適用し（27 のYes/No 指標；5つのカテゴリ）、再現性スコアを0–1で取得する。
実験設計/実装と報告に分けた9つの研究課題を定義・評価する。
文献計量学、実験設計（データセット、学習器、指標）、ベンチマーク、統計、報告に関するデータを抽出する。
RMarkdownノートブックとZenodoリポジトリを通じて分析および再現材料を提供する。

実験結果

リサーチクエスチョン

RQ1RQ1 書誌データを用いてSDP研究をどのように特徴づけることができるか？
RQ2RQ2 SDP研究で用いられる実験設計アプローチは何か？
RQ3RQ3 SDP研究の再現性はどの程度か？
RQ4RQ4 SDP研究で見つかった品質問題は何か？

主な発見

2019–2023年に Scopus で約1,585件の SDP 実験が公開されており、監査は101件の論文をサンプリングした（約6.4%）。
この101件の論文は61誌と40会議にまたがり、74のユニークソース、論文の長さは3–46ページ（中央値12）である。
約50%が有料誌で、グリーン/Openアクセスの可用性は不均一である（Gold/Diamondのみ約29、Green約23、Green not available約41、No約8）。
引用数のばらつきが大きく、全年での1年あたりの中央値は1.2（25論文は引用なし）。
正式な統計推定を用いた論文は約45%で、101件の論文で427の問題が見つかった（中央値4、問題なしの論文は1件のみ）。
実験の再現性はほぼ完璧に近いものから、ほとんどすべての必要情報を欠くものまで幅があり、再現性には大きなギャップを示している。

Figure 2: Distribution of paper length by page count

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。