QUICK REVIEW

[論文レビュー] Sources of Irreproducibility in Machine Learning: A Review

Odd Erik Gundersen, Kevin Coakley|arXiv (Cornell University)|Apr 15, 2022

Machine Learning and Data Classification被引用数 30

ひとこと要約

この論文は、機械学習実験で再現不能な結果を引き起こす可能性のある six factor categories にまたがる 41 の設計決定を識別・分類する包括的なフレームワークを構築し、モデル比較の例でフレームワークを示す。

ABSTRACT

Background: Many published machine learning studies are irreproducible. Issues with methodology and not properly accounting for variation introduced by the algorithm themselves or their implementations are attributed as the main contributors to the irreproducibility.Problem: There exist no theoretical framework that relates experiment design choices to potential effects on the conclusions. Without such a framework, it is much harder for practitioners and researchers to evaluate experiment results and describe the limitations of experiments. The lack of such a framework also makes it harder for independent researchers to systematically attribute the causes of failed reproducibility experiments. Objective: The objective of this paper is to develop a framework that enable applied data science practitioners and researchers to understand which experiment design choices can lead to false findings and how and by this help in analyzing the conclusions of reproducibility experiments. Method: We have compiled an extensive list of factors reported in the literature that can lead to machine learning studies being irreproducible. These factors are organized and categorized in a reproducibility framework motivated by the stages of the scientific method. The factors are analyzed for how they can affect the conclusions drawn from experiments. A model comparison study is used as an example. Conclusion: We provide a framework that describes machine learning methodology from experimental design decisions to the conclusions inferred from them.

研究の動機と目的

Identify and categorize factors that lead to irreproducible ML studies.
Develop a framework linking experiment design choices to potential false conclusions.
Provide guidance to researchers for designing, documenting, and evaluating reproducibility experiments.
Illustrate the framework with a model comparison study to show how conclusions can be affected by design decisions.

提案手法

Compile an extensive literature-based list of factors causing irreproducibility in ML.
Organize factors into a reproducibility framework aligned with stages of the scientific method.
Analyze how each factor can influence study conclusions and outcomes.
Present a model comparison example to demonstrate outcome, analysis, and conclusion reproducibility classifications.
Define and differentiate R1–R4 reproducibility study types based on available documentation (text, code, data, and experiments).
Provide taxonomy of 41 design decisions grouped into six categories and discuss how to control them in fair experiments.

実験結果

リサーチクエスチョン

RQ1What experiment design decisions can lead to false findings in ML studies?
RQ2How do these design decisions affect the outcomes, analyses, and conclusions of ML experiments?
RQ3How can researchers use a framework to plan, document, and evaluate reproducibility experiments?
RQ4In what ways can a model comparison study illustrate the framework's classifications of reproducibility?

主な発見

The authors identify and categorize 41 design decisions that can cause irreproducible ML results.
They present a six-category taxonomy (including design, algorithmic, implementation, observation, evaluation, and documentation factors) showing how decisions influence outcomes and conclusions.
A reproducibility framework is proposed to map design decisions to stages of the scientific method and to support reproducibility experiments.
The framework includes the concept of R1–R4 documentation levels for reproducibility studies and emphasizes reporting of variance and error alongside results.
The paper argues that the framework can complement reproducibility checklists, data sheets, challenges, and registered reports by providing a structured basis for reporting and analysis.
A model comparison example demonstrates how different design decisions can yield different conclusions even under the same hypothesis.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。