QUICK REVIEW

[论文解读] On Calibration of Modern Neural Networks

Chuan Guo, Geoff Pleiss|arXiv (Cornell University)|Jun 14, 2017

Anomaly Detection Techniques and Applications参考文献 37被引用 1,715

一句话总结

本文表明现代神经网络校准差，且简单的事后温度缩放在视觉和 NLP 任务中往往提供最佳校准。

ABSTRACT

Confidence calibration -- the problem of predicting probability estimates representative of the true correctness likelihood -- is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.

研究动机与目标

研究现代神经网络在不同架构和数据集上的校准性。
量化深度、宽度、权重衰减和批归一化对校准的影响。
评估后处理标定方法并识别实用且有效的方法。

提出的方法

使用可靠性图、ECE 和 MCE 正式定义校准。
分析架构/训练选择如何影响校准（深度/宽度、BN、权重衰减）。
比较校准方法：直方图分箱、等距回归、BBQ、Platt scaling、temperature scaling、vector scaling、matrix scaling。
将标定方法从二分类扩展到多分类情形（One-vs-All、向量/矩阵缩放、温度缩放）。
在图像和文档分类数据集上，利用最先进架构评估这些方法。

实验结果

研究问题

RQ1在不同架构和数据集上，现代神经网络的校准程度如何？
RQ2哪些架构/训练选择会导致校准不良，后处理方法是否能高效矫正？
RQ3在实际中，温度缩放是否足以或优于更复杂的标定方法？

主要发现

现代网络往往校准不足：更高的准确性并不意味着置信度被良好校准。
校准质量与模型容量、批归一化和权重衰减相关；更高容量和 BN 可能恶化校准。
温度缩放通常优于更复杂的校准方法，且计算速度快。
基于分箱的方法可以改善校准，但通常不及温度缩放；向量缩放的表现类似于温度缩放。
不同数据集的校准表现不同；Reuters 是一个例外，在那里温度缩放效果较差。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。