QUICK REVIEW

[Paper Review] Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View

Cornelia Gruber, Patrick Oliver Schenk|arXiv (Cornell University)|May 26, 2023

Explainable Artificial Intelligence (XAI)17 citations

TL;DR

The paper reframes uncertainty in supervised ML from a statistical perspective, distinguishing aleatoric and epistemic uncertainty, and highlights numerous data- and model-related sources beyond simple two-way decomposition, including overparameterization and data quality.

ABSTRACT

Supervised machine learning and predictive models have achieved an impressive standard today, enabling us to answer questions that were inconceivable a few years ago. Besides these successes, it becomes clear, that beyond pure prediction, which is the primary strength of most supervised machine learning algorithms, the quantification of uncertainty is relevant and necessary as well. However, before quantification is possible, types and sources of uncertainty need to be defined precisely. While first concepts and ideas in this direction have emerged in recent years, this paper adopts a conceptual, basic science perspective and examines possible sources of uncertainty. By adopting the viewpoint of a statistician, we discuss the concepts of aleatoric and epistemic uncertainty, which are more commonly associated with machine learning. The paper aims to formalize the two types of uncertainty and demonstrates that sources of uncertainty are miscellaneous and can not always be decomposed into aleatoric and epistemic. Drawing parallels between statistical concepts and uncertainty in machine learning, we emphasise the role of data and their influence on uncertainty.

Motivation & Objective

Clarify and formalize aleatoric and epistemic uncertainty in supervised ML from a statistical standpoint.
Illustrate limitations of a simple two-way decomposition and emphasize data-related sources of uncertainty.
Link ML uncertainty concepts to classical statistics such as bias-variance and total survey error.
Highlight how data quality, omitted/measurement errors, and deployment changes influence uncertainty.

Proposed method

Define aleatoric uncertainty as Var(Y|X=x) and classify remaining uncertainty as epistemic.
Discuss estimation uncertainty and model uncertainty within the bias-variance framework.
Use linear regression as a illustrations to show prediction intervals mixing aleatoric and estimation uncertainty.
Extend to overparameterized models and use Kullback-Leibler divergence to compare f(y|x) and p(y|x;θ).
Describe regularization as prior information when p>n, and relate to AIC-like KL considerations.
Provide simulation study to show KL divergence components when increasing model dimension.

Experimental results

Research questions

RQ1What are the formal statistical definitions of aleatoric and epistemic uncertainty in ML contexts?
RQ2How do data generation, model class, and training data influence the decomposition and estimation of uncertainty?
RQ3What happens to uncertainty sources in overparameterized or high-dimensional settings (p>n)?
RQ4How do regularization/prior choices affect the distance between true and fitted models (via KL divergence) in ML?
RQ5How do data-related issues like omitted variables and measurement errors contribute to model uncertainty?

Key findings

Aleatoric uncertainty is defined as Var(Y|x); all remaining uncertainty is epistemic.
In simple linear models, total prediction uncertainty cannot be additively decomposed into aleatoric and estimation uncertainty in a straightforward way.
Bias-variance decomposition links aleatoric uncertainty to the irreducible error and connects estimation variance and model bias to epistemic uncertainty.
Overparameterization permits a second KL-divergence minimum and necessitates regularization, leading to a trade-off between model mis-specification and estimation error.
Regularization (priors) ensures a full-rank, negative-definite Hessian for the penalized likelihood, enabling unique maximizers even when p>n.
KL divergence provides a framework to compare true vs. fitted models beyond traditional AIC in high-dimensional settings (p>n).
Data quality and unobserved variables can induce model uncertainty, showing that simple aleatoric/epistemic split may be insufficient in practice.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.