QUICK REVIEW

[論文レビュー] Is machine learning good or bad for the natural sciences?

David W. Hogg, Soledad Villar|arXiv (Cornell University)|May 28, 2024

Big Data and Business Intelligence被引用数 7

ひとこと要約

本論文は、機械学習（ML）が自然科学において価値ある役割と潜在的な落とし穴の両方を持つと主張し、導入し得る2つの主要なバイアスを詳述し、安全で因果関係を意識した利用パターンを提案する。

ABSTRACT

Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology - in which only the data exist - and a strong epistemology - in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they amplify confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics.

研究の動機と目的

機械学習の根本的な公理論と認識論を説明し、それを自然科学と対比させる。
機械学習が自然科学の研究にもたらし得る2つの顕著な統計的バイアスを特定する。
機械学習が科学的実践を強化する安全な文脈を示し、慎重で因果関係を意識した利用を主張する。
自然科学コミュニティがMLの役割を評価し、科学的理解を保持する実践を採用するよう促す。

提案手法

広範なMLの公理論（データ中心）を定義し、自然科学の潜在構造志向と対比する。
ML の認識論が、潜在解釈可能性よりも、保持データの性能に中心を置くことを説明する。
エミュレーターによる確認バイアスと訓練データセットバイアスの増幅という2つのバイアスを特定し、説明する。
リアルタイムの意思決定、ノイズ因子のモデリング、因果推論における安全なML応用の例を示す。
MLが有益になり得る状況（例：前景処理、較正、希少対象の発見）と、有害になり得る状況を論じる。

実験結果

リサーチクエスチョン

RQ1MLは自然科学の理解と発見を進める上でどんな役割を果たせるか？
RQ2自然科学分析においてMLが導入する主なバイアスは何か、それらは緩和可能か？
RQ3どの文脈でMLは理解を損なうことなく、安全で有益な貢献を提供できるか？
RQ4自然科学コミュニティはどのようにMLツールを採用して認識論的基準を維持すべきか？

主な発見

MLは現代科学において価値のある場を持ち、特に運用的・因果的文脈での利用が挙げられる。
自然科学におけるMLの導入による2つの主要なバイアス：エミュレーターによる確認バイアスと訓練データセットバイアスの増幅。
これらのバイアスは是正が難しく、ML生成ラベルやエミュレーターを下流分析で使用する際にしばしば生じる。
因果設定では、混同行為をモデリングする際に表現力のあるMLモデルが因果推論についてより保守的で頑健な結論を導くことがある。
MLを自然科学において安全で、場合によっては必須とされる用途は多数あり、リアルタイムの意思決定、雑音因子のモデリング、外れ値の検出などを含むが、認識論的基準が維持される限りである。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。