[논문 리뷰] Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View
본 논문은 감독 학습 ML의 불확실성을 통계적 관점에서 재정의하고, aleatoric과 epistemic uncertainty를 구분하며, 단순한 이원 분해를 넘어서는 수많은 데이터 및 모델 관련 소스들을 강조한다. 여기에는 overparameterization과 데이터 품질이 포함된다.
Supervised machine learning and predictive models have achieved an impressive standard today, enabling us to answer questions that were inconceivable a few years ago. Besides these successes, it becomes clear, that beyond pure prediction, which is the primary strength of most supervised machine learning algorithms, the quantification of uncertainty is relevant and necessary as well. However, before quantification is possible, types and sources of uncertainty need to be defined precisely. While first concepts and ideas in this direction have emerged in recent years, this paper adopts a conceptual, basic science perspective and examines possible sources of uncertainty. By adopting the viewpoint of a statistician, we discuss the concepts of aleatoric and epistemic uncertainty, which are more commonly associated with machine learning. The paper aims to formalize the two types of uncertainty and demonstrates that sources of uncertainty are miscellaneous and can not always be decomposed into aleatoric and epistemic. Drawing parallels between statistical concepts and uncertainty in machine learning, we emphasise the role of data and their influence on uncertainty.
연구 동기 및 목표
- Clarify and formalize aleatoric and epistemic uncertainty in supervised ML from a statistical standpoint.
- Illustrate limitations of a simple two-way decomposition and emphasize data-related sources of uncertainty.
- Link ML uncertainty concepts to classical statistics such as bias-variance and total survey error.
- Highlight how data quality, omitted/measurement errors, and deployment changes influence uncertainty.
제안 방법
- Define aleatoric uncertainty as Var(Y|X=x) and classify remaining uncertainty as epistemic.
- Discuss estimation uncertainty and model uncertainty within the bias-variance framework.
- Use linear regression as a illustrations to show prediction intervals mixing aleatoric and estimation uncertainty.
- Extend to overparameterized models and use Kullback-Leibler divergence to compare f(y|x) and p(y|x;θ).
- Describe regularization as prior information when p>n, and relate to AIC-like KL considerations.
- Provide simulation study to show KL divergence components when increasing model dimension.
실험 결과
연구 질문
- RQ1What are the formal statistical definitions of aleatoric and epistemic uncertainty in ML contexts?
- RQ2How do data generation, model class, and training data influence the decomposition and estimation of uncertainty?
- RQ3What happens to uncertainty sources in overparameterized or high-dimensional settings (p>n)?
- RQ4How do regularization/prior choices affect the distance between true and fitted models (via KL divergence) in ML?
- RQ5How do data-related issues like omitted variables and measurement errors contribute to model uncertainty?
주요 결과
- Aleatoric uncertainty is defined as Var(Y|x); all remaining uncertainty is epistemic.
- In simple linear models, total prediction uncertainty cannot be additively decomposed into aleatoric and estimation uncertainty in a straightforward way.
- Bias-variance decomposition links aleatoric uncertainty to the irreducible error and connects estimation variance and model bias to epistemic uncertainty.
- Overparameterization permits a second KL-divergence minimum and necessitates regularization, leading to a trade-off between model mis-specification and estimation error.
- Regularization (priors) ensures a full-rank, negative-definite Hessian for the penalized likelihood, enabling unique maximizers even when p>n.
- KL divergence provides a framework to compare true vs. fitted models beyond traditional AIC in high-dimensional settings (p>n).
- Data quality and unobserved variables can induce model uncertainty, showing that simple aleatoric/epistemic split may be insufficient in practice.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.