QUICK REVIEW

[論文レビュー] Leakage and the Reproducibility Crisis in ML-based Science

Sayash Kapoor, Arvind Narayanan|arXiv (Cornell University)|Jul 14, 2022

Explainable Artificial Intelligence (XAI)被引用数 139

ひとこと要約

本論文は、MLベースの科学における再現性の欠如を、17分野にわたるデータ漏洩が原因として調査し、細粒度の漏洩分類体系を導入し、漏洩を検出するモデル情報シートを提案する。これは、漏洩が修正された場合、MLモデルはロジスティック回帰を上回らないことを示す内戦予測のケーススタディによって裏付けられている。

ABSTRACT

The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 17 fields where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems. We argue for fundamental methodological changes to ML-based science so that cases of leakage can be caught before publication. To that end, we propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey. To investigate the impact of reproducibility errors and the efficacy of model info sheets, we undertake a reproducibility study in a field where complex ML models are believed to vastly outperform older statistical models such as Logistic Regression (LR): civil war prediction. We find that all papers claiming the superior performance of complex ML models compared to LR models fail to reproduce due to data leakage, and complex ML models don't perform substantively better than decades-old LR models. While none of these errors could have been caught by reading the papers, model info sheets would enable the detection of leakage in each case.

研究の動機と目的

データ漏洩がMLベースの科学における再現不能な結果の広範な要因であることを示す。
科学的主張に関連するデータ漏洩タイプの細粒度分類体系を提供する。
MLベースの科学的報告における漏洩を検出・予防するためのモデル情報シートを提案する。
内戦予測のケーススタディを通じて漏洩の影響を経験的に評価する。

提案手法

17分野の20論文を対象とした体系的文献調査で、漏洩関連の落とし穴を特定し、影響を受けた研究を定量化する。
データ収集、前処理、モデリング、評価にまたがる8つの漏洩タイプの精緻な分類体系を開発する。
漏洩中心の明示的主張を強制する報告手段としてモデル情報シートを提案する（訓練・テストの分離、特徴量の正当性、分布的一致性）。
訓練・テスト分割を用い、コード・データが利用可能だった12論文を再分析して漏洩の誤りを修正し、内戦予測に関する再現性研究を実施する。

実験結果

リサーチクエスチョン

RQ1複数の分野にまたがるMLベースの科学において、データ漏洩が再現不能な結果の原因としてどれくらい普及しているか？
RQ2MLベースの科学的主張に影響を与える漏洩の異なる様式は何で、どのように検出・軽減できるか？
RQ3モデル情報シートは、内戦予測のケーススタディで示されたように、分野横断で漏洩を確実に検出または防止できるだろうか？
RQ4漏洩が対処された後、複雑なMLモデルはロジスティック回帰より実質的な利点を提供するか？

主な発見

データ漏洩は17分野にわたる広く普及した落とし穴で、329論文に影響を与えている。
著者らは8つの漏洩タイプを特定し、教科書的エラーから分布のずれや非独立性に至る範囲を挙げる。
エンジニアリング・モデリング競技からの緩和策は、MLベースの科学に直接適用できない。
モデル情報シートは漏洩を検出でき、論文を読むだけでは漏洩を明らかにできないため必要である。
内戦予測では、複雑なMLモデルがロジスティック回帰より優れていると主張する論文は漏洩のため再現できず、修正後は複雑なモデルが顕著に良くはない。
不確実性の定量化と有意性検定は、MLモデルを比較する論文ではしばしば欠如している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。