QUICK REVIEW

[論文レビュー] On Calibration of Modern Neural Networks

Chuan Guo, Geoff Pleiss|arXiv (Cornell University)|Jun 14, 2017

Anomaly Detection Techniques and Applications参考文献 37被引用数 1,715

ひとこと要約

論文は現代のニューラルネットワークが適切に較正されていないことと、単純な事後温度スケーリングが視覚タスクとNLPタスクの間で最もよい較正を得ることが多いことを示している。

ABSTRACT

Confidence calibration -- the problem of predicting probability estimates representative of the true correctness likelihood -- is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.

研究の動機と目的

Investigate calibration of modern neural networks across architectures and datasets.
Quantify how depth, width, weight decay, and Batch Normalization affect calibration.
Evaluate post-processing calibration methods and identify practical, effective approaches.

提案手法

Define calibration formally using reliability diagrams, ECE, and MCE.
Analyze how architecture/training choices affect calibration (depth/width, BN, weight decay).
Compare calibration methods: histogram binning, isotonic regression, BBQ, Platt scaling, temperature scaling, vector/matrix scaling.
Extend calibration methods from binary to multiclass cases (one-vs-all, vector/matrix scaling, temperature scaling).
Evaluate methods on image and document classification datasets with state-of-the-art architectures.

実験結果

リサーチクエスチョン

RQ1How well calibrated are modern neural networks across architectures and datasets?
RQ2Which architectural/training choices drive miscalibration, and can post-hoc methods rectify it efficiently?
RQ3Is temperature scaling sufficient or superior to more complex calibration methods in practice?

主な発見

Modern networks are often miscalibrated: higher accuracy does not imply well-calibrated confidences.
Calibration quality correlates with model capacity, Batch Normalization, and weight decay; more capacity and BN can worsen calibration.
Temperature scaling commonly outperforms more complex calibration methods and is fast to compute.
Bin-based methods improve calibration but are typically outperformed by temperature scaling; vector scaling behaves similarly to temperature scaling.
Calibration performance varies by dataset; Reuters is an exception where temperature scaling is less effective.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。