QUICK REVIEW

[论文解读] Leakage and the Reproducibility Crisis in ML-based Science

Sayash Kapoor, Arvind Narayanan|arXiv (Cornell University)|Jul 14, 2022

Explainable Artificial Intelligence (XAI)被引用 139

一句话总结

本论文评估 ML 基于科学中的可重复性失败，原因是跨 17 个领域的数据泄漏，提出细粒度的泄漏分类法，并提出模型信息表以检测泄漏；并以内战预测案例研究为例，表明在纠正泄漏后，ML 模型并不优于逻辑回归。

ABSTRACT

The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 17 fields where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems. We argue for fundamental methodological changes to ML-based science so that cases of leakage can be caught before publication. To that end, we propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey. To investigate the impact of reproducibility errors and the efficacy of model info sheets, we undertake a reproducibility study in a field where complex ML models are believed to vastly outperform older statistical models such as Logistic Regression (LR): civil war prediction. We find that all papers claiming the superior performance of complex ML models compared to LR models fail to reproduce due to data leakage, and complex ML models don't perform substantively better than decades-old LR models. While none of these errors could have been caught by reading the papers, model info sheets would enable the detection of leakage in each case.

研究动机与目标

显示数据泄漏是 ML 基于科学中不可重复结果的广泛驱动因素。
提供与科学性断言相关的细粒度数据泄漏类型分类。
提出模型信息表作为检测和防止 ML 基于科学报告中泄漏的工具。
通过内战预测案例研究进行经验评估泄漏的影响。

提出的方法

对 17 个领域中的 20 篇论文进行系统文献调查，以识别与泄漏相关的陷阱并量化受影响的工作。
提出涵盖数据收集、预处理、建模和评估的 8 种泄漏类型的细粒度分类法。
提出模型信息表作为报告工具，强制对泄漏为中心的论证（训练-测试分离、特征合法性、分布匹配）进行明确表述。
在内战预测中进行可重复性研究，重新分析 12 篇使用了训练-测试分割且具有可用代码/数据的论文，纠正泄漏错误。

实验结果

研究问题

RQ1跨越多学科，数据泄漏在 ML 基于科学中的不可重复结果中有多普遍吗？
RQ2影响 ML 基于科学声明的不同泄漏模态有哪些，如何检测和缓解？
RQ3模型信息表能否在各领域可靠地揭示或防止泄漏，如在内战预测案例研究中所示？
RQ4一旦解决泄漏，复杂的 ML 模型是否对比逻辑回归仍有实质性优势？

主要发现

数据泄漏在 17 个领域中普遍存在，影响了 329 篇论文。
作者识别出 8 种泄漏类型，覆盖从教科书错误到分布不对齐和非独立性。
来自工程/建模竞赛的缓解措施并不直接转化为对 ML 基于科学的改进。
模型信息表可以检测泄漏且是必要的，因为单靠阅读论文无法揭示泄漏。
在内战预测中，声称复杂 ML 模型优于逻辑回归的论文未能重现；纠正泄漏后，复杂模型并未显著更好。
在比较 ML 模型的论文中，往往缺少不确定性量化和显著性检验。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。