QUICK REVIEW

[论文解读] Uncertainty in Gradient Boosting via Ensembles

Andrey Malinin, Liudmila Prokhorenkova|arXiv (Cornell University)|Jun 18, 2020

Gaussian Processes and Bayesian Inference参考文献 35被引用 33

一句话总结

本文提出基于集成的GBDT不确定性估计方法，包括 Stochastic Gradient Langevin Boosting (SGLB) 和虚拟集成（vSGLB），用于量化数据不确定性和知识不确定性，应用于领域外检测。

ABSTRACT

For many practical, high-risk applications, it is essential to quantify uncertainty in a model's predictions to avoid costly mistakes. While predictive uncertainty is widely studied for neural networks, the topic seems to be under-explored for models based on gradient boosting. However, gradient boosting often achieves state-of-the-art results on tabular data. This work examines a probabilistic ensemble-based framework for deriving uncertainty estimates in the predictions of gradient boosting classification and regression models. We conducted experiments on a range of synthetic and real datasets and investigated the applicability of ensemble approaches to gradient boosting models that are themselves ensembles of decision trees. Our analysis shows that ensembles of gradient boosting models successfully detect anomalous inputs while having limited ability to improve the predicted total uncertainty. Importantly, we also propose a concept of a virtual ensemble to get the benefits of an ensemble via only one gradient boosting model, which significantly reduces complexity.

研究动机与目标

动机并正式化在用于表格数据的 GBDT 模型中对预测不确定性的需求。
开发基于集成的框架，以将数据不确定性与知识不确定性从 GBDT 预测中分离。
提出生成 GBDT 模型集合（SGB 和 SGLB）的方法，并引入虚拟集成（vSGLB）以降低计算量。
分析基于集成的不确定性估计在合成数据上的特性，并在分类与回归基准上进行评估。

提出的方法

将不确定性框定在贝叶斯集成的视角中，其中模型参数是随机变量，预测在后验样本上聚合。
描述基于熵的总不确定性和基于互信息的知识不确定性，以及用于回归的方差分解。
描述三种集成策略：SGB（随机数据子采样）、SGLB（用 Langevin 动力学从后验采样）以及虚拟 SGLB（使用单个 GBDT 的截断子模型）。
解释 SGLB 的更新：注入高斯噪声以及收缩型更新规则，产生一个平稳的后验分布。
通过从 SGLB轨迹中选择每第 K 个参数集来构建虚拟集成，以降低成本。
在回归中使用 NGBoost 风格的预测分布（均值和方差），在分类中使用类别分布，训练目标为负对数似然。

实验结果

研究问题

RQ1集合方法（SGB、SGLB）是否能够为 GBDT 模型提供有意义的数据不确定性和知识不确定性估计？
RQ2虚拟集成（vSGLB）在降低计算成本的同时，是否能保留不确定性优势？
RQ3基于集成的不确定性估计在检测领域外输入和分类/回归任务错误方面的表现如何？
RQ4在实际的 GBDT 不确定性估计中，SGB、SGLB 与 vSGLB 的比较优势是什么？

主要发现

GBDT 模型的集合可以通过总不确定性和知识不确定性增大来检测异常（领域外）输入，其中知识不确定性突出显示 OOD 区域。
SGLB 集成在渐进意义上从真实后验进行采样，从而实现有据可依的不确定性估计。
从单个 GBDT 模型派生的虚拟集成（vSGLB）可以产生有用的知识不确定性信号，特别是在具有分类特征的分类任务中，同时降低计算成本。
对于回归和分类任务，总不确定性通常在错误检测方面比知识不确定性更有效，后者提供更强的 OOD 信号。
由于截断子模型之间的相关性，vSGLB 往往不如真正的 SGLB 集成，但在某些设置中仍然有用（特别是在具有分类特征时）。
总体而言，集成为 GBDT 提供了有据可依的不确定性估计，在使用知识不确定性时能实现更好的 OOD 检测，而 vSGLB 提供了一种便宜但有时较弱的替代方案。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。