[論文レビュー] On Calibration of Modern Neural Networks
論文は現代のニューラルネットワークが適切に較正されていないことと、単純な事後温度スケーリングが視覚タスクとNLPタスクの間で最もよい較正を得ることが多いことを示している。
Confidence calibration -- the problem of predicting probability estimates representative of the true correctness likelihood -- is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.
研究の動機と目的
- Investigate calibration of modern neural networks across architectures and datasets.
- Quantify how depth, width, weight decay, and Batch Normalization affect calibration.
- Evaluate post-processing calibration methods and identify practical, effective approaches.
提案手法
- Define calibration formally using reliability diagrams, ECE, and MCE.
- Analyze how architecture/training choices affect calibration (depth/width, BN, weight decay).
- Compare calibration methods: histogram binning, isotonic regression, BBQ, Platt scaling, temperature scaling, vector/matrix scaling.
- Extend calibration methods from binary to multiclass cases (one-vs-all, vector/matrix scaling, temperature scaling).
- Evaluate methods on image and document classification datasets with state-of-the-art architectures.
実験結果
リサーチクエスチョン
- RQ1How well calibrated are modern neural networks across architectures and datasets?
- RQ2Which architectural/training choices drive miscalibration, and can post-hoc methods rectify it efficiently?
- RQ3Is temperature scaling sufficient or superior to more complex calibration methods in practice?
主な発見
- Modern networks are often miscalibrated: higher accuracy does not imply well-calibrated confidences.
- Calibration quality correlates with model capacity, Batch Normalization, and weight decay; more capacity and BN can worsen calibration.
- Temperature scaling commonly outperforms more complex calibration methods and is fast to compute.
- Bin-based methods improve calibration but are typically outperformed by temperature scaling; vector scaling behaves similarly to temperature scaling.
- Calibration performance varies by dataset; Reuters is an exception where temperature scaling is less effective.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。