QUICK REVIEW

[论文解读] Verified Uncertainty Calibration

Ananya Kumar, Percy Liang|arXiv (Cornell University)|Sep 23, 2019

Statistical Methods and Inference参考文献 47被引用 85

一句话总结

引入 scaling-binning 校准器，以实现具有有利样本复杂度的校准概率，展示缩放方法低估了校准误差，并提供对校准误差的去偏估计量，具备更好的样本效率；在 CIFAR-10 和 ImageNet 上进行验证。

ABSTRACT

Applications such as weather forecasting and personalized medicine demand models that output calibrated probability estimates---those representative of the true likelihood of a prediction. Most models are not calibrated out of the box but are recalibrated by post-processing model outputs. We find in this work that popular recalibration methods like Platt scaling and temperature scaling are (i) less calibrated than reported, and (ii) current techniques cannot estimate how miscalibrated they are. An alternative method, histogram binning, has measurable calibration error but is sample inefficient---it requires $O(B/ε^2)$ samples, compared to $O(1/ε^2)$ for scaling methods, where $B$ is the number of distinct probabilities the model can output. To get the best of both worlds, we introduce the scaling-binning calibrator, which first fits a parametric function to reduce variance and then bins the function values to actually ensure calibration. This requires only $O(1/ε^2 + B)$ samples. Next, we show that we can estimate a model's calibration error more accurately using an estimator from the meteorological community---or equivalently measure its calibration error with fewer samples ($O(\sqrt{B})$ instead of $O(B)$). We validate our approach with multiclass calibration experiments on CIFAR-10 and ImageNet, where we obtain a 35% lower calibration error than histogram binning and, unlike scaling methods, guarantees on true calibration. In these experiments, we also estimate the calibration error and ECE more accurately than the commonly used plugin estimators. We implement all these methods in a Python library: https://pypi.org/project/uncertainty-calibration

研究动机与目标

在关键应用中（医学、天气、NLP）需要概率校准的动机。
展示常见再校准方法（Platt 缩放、温度缩放）在真正的校准和误差估计方面的局限性。
提出一种将缩放与分箱相结合的方法，以在具有有利样本复杂度的同时实现可测量的校准误差。
开发高效的校准误差估计量，包括具有较低样本复杂度的去偏估计量。
在多类别数据集（CIFAR-10、ImageNet）上经验性验证校准性能和估计精度。

提出的方法

提出一种 scaling-binning 校准器，先在重校准数据上拟合 G 家族中的函数 g。
在 g(z) 输出上构造均匀质量分箱方案，以对变换后的分数进行分箱。
通过输出每个分箱的平均 g(z) 值来离散化，得到 g_B，并通过 g_B∘f 进行校准。
给出理论校准界：CE(g_B) ≤ √(2)·min_g∈G CE(g) + ε，在 n ≥ c(B log B + log B / ε^2) 样本条件下。
证明在某些条件下，对 g 输出进行分箱比仅使用 g 本身能获得更低的校准误差。
提供用于校准保证和相比直方图分箱的样本复杂度改进的算法和证明（要点）。

实验结果

研究问题

RQ1当真实校准误差难以测量时，缩放方法（Platt、温度缩放）是否能可靠地对概率进行校准？
RQ2我们是否可以设计一种既具备样本效率又具有可验证校准保证的再校准方法？
RQ3将缩放与直方图式分箱（scaling-binning）结合，是否能在校准误差和可测量性方面优于现有方法？
RQ4如何在多类别设置中更高效地估计校准误差？
RQ5分箱策略对校准误差测量和均方误差（MSE）的影响是什么？

主要发现

Scaling-binning 校准器在 CIFAR-10 和 ImageNet（B=100）下的校准误差低于直方图分箱。
该方法需要 O(1/ε^2 + B) 样本以达到校准误差 ε，相比直方图分箱的 O(B/ε^2) 有所改进。
分箱估计量使得在保证（ε-近似）范围内测量 ˆE^2 的校准误差估计变得高效。
去偏估计量将校准误差估计的样本复杂度从 O(B) 降低到 O(√B)。
实验表明在 CIFAR-10 上校准误差降低 35%，在 ImageNet 上降低 5 倍（B=100），相比直方图分箱；但缩放方法不提供真正的校准保证。
研究提供了一个用于不确定性校准的开源 Python 库，见提供的 URL。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。