QUICK REVIEW

[论文解读] Measuring Calibration in Deep Learning

Jeremy Nixon, Mike Dusenberry|arXiv (Cornell University)|Apr 2, 2019

Adversarial Robustness in Machine Learning参考文献 29被引用 156

一句话总结

本文批判性分析多分类分类器的校准度量标准，指出像 ECE 这样的常见度量可能具有误导性，并提出替代度量（ACE、SCE、GCE）及最佳实践建议。

ABSTRACT

Overconfidence and underconfidence in machine learning classifiers is measured by calibration: the degree to which the probabilities predicted for each class match the accuracy of the classifier on that prediction. How one measures calibration remains a challenge: expected calibration error, the most popular metric, has numerous flaws which we outline, and there is no clear empirical understanding of how its choices affect conclusions in practice, and what recommendations there are to counteract its flaws. In this paper, we perform a comprehensive empirical study of choices in calibration measures including measuring all probabilities rather than just the maximum prediction, thresholding probability values, class conditionality, number of bins, bins that are adaptive to the datapoint density, and the norm used to compare accuracies to confidences. To analyze the sensitivity of calibration measures, we study the impact of optimizing directly for each variant with recalibration techniques. Across MNIST, Fashion MNIST, CIFAR-10/100, and ImageNet, we find that conclusions on the rank ordering of recalibration methods is drastically impacted by the choice of calibration measure. We find that conditioning on the class leads to more effective calibration evaluations, and that using the L2 norm rather than the L1 norm improves both optimization for calibration metrics and the rank correlation measuring metric consistency. Adaptive binning schemes lead to more stablity of metric rank ordering when the number of bins vary, and is also recommended. We open source a library for the use of our calibration measures.

研究动机与目标

评估 Expected Calibration Error (ECE) 在多分类设置中的局限性和病态特征。
提出并分析解决类别条件化、适应性和范数选择的问题的替代校准度量。
研究分箱、阈值和再校准对跨数据集校准评估的影响。
提供实际建议和用于稳健校准评估的开源工具。

提出的方法

在五个属性（类别条件性、适应性、最大概率焦点、范数、阈值化）上对校准误差定义进行形式分析。
将 General Calibration Error (GCE) 定义为可配置的度量空间并进行评估。
引入 Adaptive Calibration Error (ACE)，在校准区间内使用等频分箱。
将 Static Calibration Error (SCE) 定义为多分类扩展，按每一类概率进行分箱。
讨论阈值化以处理大量接近零的概率及其对校准估计的影响。
在 MNIST、Fashion-MNIST、CIFAR-10/100 和 ImageNet 上进行实证评估，以研究度量行为和再校准效果。

实验结果

研究问题

RQ1在多分类设置中，校准度量的选择如何影响对模型校准的结论？
RQ2类条件化的校准度量是否比聚合、非条件化的度量提供更可靠的评价？
RQ3自适应分箱、范数选择（L1 与 L2）以及阈值化对校准评估和方法排序有何影响？
RQ4再校准技术如何在不同数据集上与不同校准度量交互作用？
RQ5可以提出哪些实际建议来提升校准评估的鲁棒性和可比性？

主要发现

ECE 存在多处缺陷（忽略非最大概率、固定分箱以及缺乏类别条件化），从而扭曲校准评估。
按类别条件化的校准度量揭示各类别的校准不均匀性，提供更具信息量的评估。
自适应分箱（ACE）在分箱数量变化时稳定度量排序，实际表现优于静态分箱。
使用 L2 范数通常有助于优化校准度量并提高排序相关性的一致性。
再校准方法的排序在不同校准度量下变化很大，表明结论取决于所用度量。
自适应校准方法在跨数据集和架构的比较中更稳健、可靠。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。