QUICK REVIEW

[논문 리뷰] Is machine learning good or bad for the natural sciences?

David W. Hogg, Soledad Villar|arXiv (Cornell University)|2024. 05. 28.

Big Data and Business Intelligence인용 수 7

한 줄 요약

본 논문은 기계 학습(ML)이 자연과학에서 가치 있는 역할과 잠재적 함정 모두를 가지고 있음을 주장하며, ML이 도입할 수 있는 두 가지 주요 편향을 자세히 설명하고 안전하고 인과 인식적인 사용 패턴을 제안한다.

ABSTRACT

Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology - in which only the data exist - and a strong epistemology - in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they amplify confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics.

연구 동기 및 목표

기계 학습의 기본적인 존재론(Ontology)과 인식론(Epistemology)을 설명하고 이를 자연과학과 대조한다.
ML이 자연과학 연구에 도입할 수 있는 두 가지 강력한 통계적 편향을 식별한다.
ML이 과학적 실천을 강화하는 안전한 맥락을 매핑하고 신중하며 인과 인식적인 사용을 주장한다.
자연과학 커뮤니티가 ML의 역할을 평가하고 과학적 이해를 보존하는 관행을 채택하도록 권장한다.

제안 방법

데이터 중심의 광범위한 ML 존재론을 정의하고 이를 자연과학의 잠재 구조 초점과 대조한다.
ML의 인식론이 잠재 해석 가능성보다 보류된 데이터 성능에 중심을 두는 방식에 대해 설명한다.
에뮬레이터에 의해 야기된 확인 편향과 학습 데이터 세트 편향 증폭의 두 가지 편향을 식별하고 설명한다.
실시간 의사결정, 잡음 모델링, 인과추론에서의 안전한 ML 응용 사례를 제시한다.
ML이 유익할 수 있는 맥락(예: 전경, 보정, 희귀 객체 발견)과 해로울 수 있는 맥락을 논의한다.

실험 결과

연구 질문

RQ1ML이 자연과학 이해와 탐구를 발전시키는 데 어떤 역할을 할 수 있는가?
RQ2자연과학 분석에 ML이 도입하는 주요 편향은 무엇이며 이를 완화할 수 있는가?
RQ3이해를 해치지 않으면서 안전하고 유익한 기여를 ML이 어떤 맥락에서 제공할 수 있는가?
RQ4자연과학 커뮤니티는 인식 표준을 보존하기 위해 ML 도구를 어떻게 채택해야 하는가?

주요 결과

ML은 특히 운영적·인과 맥락의 사용에서 현대 과학에 가치 있는 위치를 차지한다.
두 가지 주요 편향은 에뮬레이터에 의해 유발된 확인 편향과 학습 데이터 세트 편향 증폭이다.
이 편향은 바로잡기 어려울 수 있으며 다운스트림 분석에서 ML이 생성한 라벨이나 에뮬레이터를 사용할 때 종종 발생한다.
인과 설정에서 표현력이 높은 ML 모델은 교란 변수들을 모델링할 때 인과에 대한 더 보수적이고 강건한 결론을 낼 수 있다.
ML의 안전하고 필요에 가까운 활용 사례가 자연과학에 다수 존재하며, 실시간 의사결정, 잡음 모델링, 이상값 탐지 등을 포함하되 인식 표준이 유지되는 한 그렇다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.