QUICK REVIEW

[论文解读] Is machine learning good or bad for the natural sciences?

David W. Hogg, Soledad Villar|arXiv (Cornell University)|May 28, 2024

Big Data and Business Intelligence被引用 7

一句话总结

论文认为机器学习在自然科学中既有宝贵作用也存在潜在陷阱，详细描述了它可能引入的两个主要偏差并提出安全、具因果认知的使用模式。

ABSTRACT

Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology - in which only the data exist - and a strong epistemology - in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they amplify confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics.

研究动机与目标

描述机器学习的基本本体论与认识论，并将其与自然科学进行对比。
识别机器学习在自然科学研究中可能引入的两种强偏差。
绘制机器学习在科学实践中可提升的安全场景，并主张谨慎、具因果认知的使用。
鼓励自然科学界评估机器学习的角色并采用能保持科学理解的做法。

提出的方法

定义一个广义的机器学习本体论（以数据为中心）并与自然科学的潜在结构焦点进行对比。
解释机器学习的认识论如何以被保留的数据表现为核心，而非潜在可解释性。
识别并解释两种偏差：模拟器引发的确认偏误和训练集偏差放大。
提供在实时决策、干扰建模和因果推断等情境中安全的机器学习应用示例。
讨论机器学习在有益情境中的可能性（如前景、标定、稀有对象发现）以及其可能带来伤害的情境。

实验结果

研究问题

RQ1机器学习在推进自然科学理解与发现中可以扮演哪些角色？
RQ2机器学习在自然科学分析中引入的主要偏差是什么，是否可以减轻？
RQ3在何种情境下机器学习可以提供安全、有益的贡献而不影响理解？
RQ4自然科学界应如何采用机器学习工具以维护认识论标准？

主要发现

机器学习在当代科学中具有宝贵的地位，尤其在操作性和因果情境中使用。
自然科学中由机器学习引入的两个主要偏差：模拟器引发的确认偏误和训练集偏差放大。
这些偏差在纠正时往往困难，且常在使用机器学习生成的标签或模拟器进行下游分析时出现。
在因果情境中，表达能力强的机器学习模型在建模混杂变量时可能提供更为保守、鲁棒的因果结论。
在自然科学中存在大量安全甚至必要的机器学习用途，包括实时决策、干扰建模和离群点检测，前提是维持认识论标准。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。