QUICK REVIEW

[論文レビュー] Quality of Uncertainty Quantification for Bayesian Neural Network Inference

Jiayu Yao, Weiwei Pan|arXiv (Cornell University)|Jun 24, 2019

Gaussian Processes and Bayesian Inference参考文献 19被引用数 74

ひとこと要約

この論文は、ベイズニューラルネットワークの推論法を10種比較し、不確実性を定量化する際の品質を評価します。テスト対数尤度のような一般的な指標が誤解を招く場合があり、後方分布をより豊かに捉える手法が必ずしもより良い後方近似を提供するとは限りません。

ABSTRACT

Bayesian Neural Networks (BNNs) place priors over the parameters in a neural network. Inference in BNNs, however, is difficult; all inference methods for BNNs are approximate. In this work, we empirically compare the quality of predictive uncertainty estimates for 10 common inference methods on both regression and classification tasks. Our experiments demonstrate that commonly used metrics (e.g. test log-likelihood) can be misleading. Our experiments also indicate that inference innovations designed to capture structure in the posterior do not necessarily produce high quality posterior approximations.

研究の動機と目的

Motivate robust evaluation of uncertainty in Bayesian neural networks beyond standard predictive metrics.
Compare a broad range of approximate inference methods on regression and classification tasks.
Investigate how well different methods approximate the true posterior and how that relates to predictive uncertainty.
Provide guidance on when common uncertainty metrics are reliable or misleading.

提案手法

Evaluate 10 inference methods (BBB, PBP, BB-ALPHA, MNF, MVG, BBH, Dropout, Ensemble, SGLD, SGHMC) against ground-truth HMC.
Create synthetic, ground-truth-like datasets where posterior predictive uncertainty can be meaningfully assessed.
Use fixed priors and neural networks (1 hidden layer for regression, 2 hidden layers for classification) and optimize with Adam (except HMC/SGLD/SGHMC).
Assess posterior predictive quality via multiple metrics including RMSE, test marginal log-likelihood (LogLL), Prediction Interval Coverage Probability (PICP), and Mean Prediction Interval Width (MPIW).
Argue that log-likelihood and calibration metrics can be poor proxies for posterior fidelity and illustrate with ground-truth-like experiments.]
research_questions:[

実験結果

リサーチクエスチョン

RQ1How do different Bayesian neural network inference methods compare in terms of predictive uncertainty quality?
RQ2Do common uncertainty metrics reliably reflect fidelity to the true posterior across tasks and data regimes?
RQ3Does incorporating posterior structure via advanced variational families or ensembles translate to better posterior approximations in practice?

主な発見

	RMSE	LogLL	PICP	MWPI
HMC	0.85 ± 0.01	-1.40 ± 0.28	0.86 ± 0.00	1.79 ± 0.02
BBB	2.33 ± 0.11	-41.12 ± 6.23	0.46 ± 0.04	1.67 ± 0.04
PBP	2.92 ± 0.29	-106.78 ± 19.64	0.32 ± 0.07	0.81 ± 0.00
BB- α	1.86 ± 1.65	-5.41 ± 2.82	0.78 ± 0.09	6.57 ± 12.74
MVG	1.70 ± 0.37	-26.69 ± 12.18	0.64 ± 0.03	1.47 ± 0.20
MNF	1.11 ± 0.21	-13.85 ± 6.87	0.63 ± 0.04	0.92 ± 0.02
BBH	1.32 ± 0.22	-12.55 ± 8.23	0.65 ± 0.03	1.21 ± 0.19
Dropout	1.45 ± 0.17	-5.99 ± 1.82	0.60 ± 0.04	1.64 ± 0.18
Ensemble	0.90 ± 0.01	-6.65 ± 0.75	0.84 ± 0.00	1.50 ± 0.05
SGLD	0.86 ± 0.08	-3.60 ± 0.75	0.75 ± 0.02	2.62 ± 0.04
SGHMC	1.18 ± 0.01	-1.27 ± 0.26	0.86 ± 0.00

Test log-likelihood and calibration metrics can be misleading indicators of posterior fidelity; they may not reflect true posterior approximation quality.
Some methods that capture posterior structure do not consistently yield better approximations of the ground-truth posterior.
SGHMC tends to produce posterior predictives most similar to HMC, while SGLD often underestimates uncertainty.
Ensembles can give unreliable uncertainty estimates if model diversity is not appropriately encouraged.
Methods with richer divergences or structured variational families do not universally outperform simpler approaches in these experiments.
Across tasks, many approximate methods produce predictive distributions that understate uncertainty in regions undersampled by data.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。