[论文解读] An Audit of Machine Learning Experiments on Software Defect Prediction
这篇论文对近年的 SDP 实验(2019–2023)在实验设计、分析、报告和可重复性方面进行审计,抽样 101 篇论文以揭示广泛的做法差异和可重复性差距。
Background: Machine learning algorithms are widely used to predict defect prone software components. In this literature, computational experiments are the main means of evaluation, and the credibility of results depends on experimental design and reporting. Objective: This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices against accepted norms from statistics, machine learning, and empirical software engineering. The aim is to characterise current practice and assess the reproducibility of published results. Method: We audited SDP studies indexed in SCOPUS between 2019 and 2023, focusing on design and analysis choices such as outcome measures, out of sample validation strategies, and the use of statistical inference. Nine study issues were evaluated. Reproducibility was assessed using the instrument proposed by González Barahona and Robles. Results: The search identified approximately 1,585 SDP experiments published during the period. From these, we randomly sampled 101 papers, including 61 journal and 40 conference publications, with almost 50 percent behind paywalls. We observed substantial variation in research practice. The number of datasets ranged from 1 to 365, learners or learner variants from 1 to 34, and performance measures from 1 to 9. About 45 percent of studies applied formal statistical inference. Across the sample, we identified 427 issues, with a median of four per paper, and only one paper without issues. Reproducibility ranged from near complete to severely limited. We also identified two cases of tortured phrases and possible paper mill activity. Conclusions: Experimental design and reporting practices vary widely, and almost half of the studies provide insufficient detail to support reproduction. The audit indicates substantial scope for improvement.
研究动机与目标
- 评估 SDP 实验设计、分析和报告相对于既定规范的现状。
- 描述 2019–2023 年 SDP 实验的文献计量与方法学格局。
- 使用既定工具评估 SDP 研究的可重复性前景。
- 识别常见的质量问题和报告差距,以促进 SDP 研究社区的改进。
提出的方法
- 对 Scopus(2019–2023)的 SDP 实验进行系统性审计。
- 采用分层随机抽样选取 101 篇论文(61 篇期刊,40 篇会议)。
- 将 González-Barahona 和 Robles 的可重复性工具(27 个是/否指标;5 个类别)进行改编,以获得 0–1 的可重复性分数。
- 定义并评估九个研究问题,分为实验设计/实现和报告两部分。
- 提取关于文献计量、实验设计(数据集、学习器、度量)、基准、统计和报告的数据。
- 通过 RMarkdown notebook 和 Zenodo 仓库提供分析和复制材料。

实验结果
研究问题
- RQ1RQ1 我们如何使用文献计量数据来表征 SDP 研究?
- RQ2RQ2 SDP 研究中使用了哪些实验设计方法?
- RQ3RQ3 SDP 研究的可重复性如何?
- RQ4RQ4 在 SDP 研究中发现了哪些质量问题?
主要发现
- 约有 1,585 份 SDP 实验在 2019–2023 年发表在 Scopus 中;该审计抽样 101 篇论文(约 6.4%)。
- 这 101 篇论文跨越 61 本期刊和 40 个会议,覆盖 74 个唯一来源,论文长度范围 3–46 页(中位数 12)。
- 约 50% 的论文处于付费墙后,绿色/开放获取的可用性不均衡(仅 ~29 Gold/Diamond,~23 Green,~41 Green 不可用,~8 No)。
- 引用变异性较高;所有年份的年中位引用数为 1.2(有 25 篇论文没有被引用)。
- 近 45% 的论文使用正式统计推断;在 101 篇论文中发现 427 个问题(中位数 4;只有一篇论文没有问题)。
- 实验的可重复性介于近乎完美到几乎缺失所有必需信息之间,表明存在实质性的可重复性差距。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。