QUICK REVIEW

[论文解读] Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View

Cornelia Gruber, Patrick Oliver Schenk|arXiv (Cornell University)|May 26, 2023

Explainable Artificial Intelligence (XAI)被引用 17

一句话总结

该论文从统计学角度重新框定有监督学习中的不确定性，区分 aleatoric 与 epistemic 不确定性，并强调除简单的两路分解之外的众多数据和模型相关来源，包括 overparameterization 和数据质量。

ABSTRACT

Supervised machine learning and predictive models have achieved an impressive standard today, enabling us to answer questions that were inconceivable a few years ago. Besides these successes, it becomes clear, that beyond pure prediction, which is the primary strength of most supervised machine learning algorithms, the quantification of uncertainty is relevant and necessary as well. However, before quantification is possible, types and sources of uncertainty need to be defined precisely. While first concepts and ideas in this direction have emerged in recent years, this paper adopts a conceptual, basic science perspective and examines possible sources of uncertainty. By adopting the viewpoint of a statistician, we discuss the concepts of aleatoric and epistemic uncertainty, which are more commonly associated with machine learning. The paper aims to formalize the two types of uncertainty and demonstrates that sources of uncertainty are miscellaneous and can not always be decomposed into aleatoric and epistemic. Drawing parallels between statistical concepts and uncertainty in machine learning, we emphasise the role of data and their influence on uncertainty.

研究动机与目标

从统计角度澄清并形式化有监督学习中的 aleatoric 与 epistemic 不确定性。
说明简单的两路分解的局限性，并强调数据相关的不确定性来源。
将 ML 不确定性概念与经典统计学如偏差-方差与总调查误差等联系起来。
突出数据质量、遗漏变量/测量误差以及部署变动如何影响不确定性。

提出的方法

将 aleatoric 不确定性定义为 Var(Y|X=x)，并将其余不确定性归类为 epistemic。
在偏差-方差框架内讨论估计不确定性和模型不确定性。
以线性回归作为示例，展示将 aleatoric 与估计不确定性混合的预测区间。
扩展到过参数化模型，使用 KL 散度比较 f(y|x) 与 p(y|x;θ)。
将正则化描述为在 p>n 时的先验信息，并将其与类似 AIC 的 KL 考量联系起来。
提供仿真实验，展示随着模型维度增加时 KL 散度分量。

实验结果

研究问题

RQ1在 ML 语境中，aleatoric 与 epistemic 不确定性的正式统计定义是什么？
RQ2数据生成、模型类和训练数据如何影响不确定性分解与估计？
RQ3在过参数化或高维设置（p>n）下，不确定性来源会怎样？
RQ4正则化/先验选择如何通过 KL 散度影响真实模型与拟合模型之间的距离？
RQ5数据相关的问题如遗漏变量和测量误差如何增加模型不确定性？

主要发现

Aleatoric 不确定性定义为 Var(Y|x)；所有剩余的不确定性是 epistemic。
在简单线性模型中，总预测不确定性不能以直接的方式分解为 aleatoric 与 estimation 不确定性。
偏差-方差分解将 aleatoric 不确定性与不可约误差联系起来，并将估计方差与模型偏差连接到 epistemic 不确定性。
过参数化允许第二个 KL-散度极小值，并需要正则化，导致模型错配与估计误差之间的权衡。
正则化（先验）确保惩罚似然的海塞矩阵为满秩、负定，即使在 p>n 时也能实现唯一极大值。
KL 散度提供了一个框架，在高维设置（p>n）下超越传统的 AIC 比较真实模型与拟合模型。
数据质量和未观测变量可引发模型不确定性，表明在实践中简单的 aleatoric/epistemic 分离可能并不充分。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。