[Paper Review] On the Importance of Strong Baselines in Bayesian Deep Learning
This paper demonstrates that Monte Carlo dropout, a widely used baseline in Bayesian deep learning, outperforms or matches state-of-the-art methods when evaluated under identical training conditions. The authors expose a critical flaw in prior benchmarking practices—comparing models trained to convergence against baselines trained for only 40 epochs—showing that stronger baselines invalidate claims of superiority in several recent works.
Like all sub-fields of machine learning Bayesian Deep Learning is driven by empirical validation of its theoretical proposals. Given the many aspects of an experiment it is always possible that minor or even major experimental flaws can slip by both authors and reviewers. One of the most popular experiments used to evaluate approximate inference techniques is the regression experiment on UCI datasets. However, in this experiment, models which have been trained to convergence have often been compared with baselines trained only for a fixed number of iterations. We find that a well-established baseline, Monte Carlo dropout, when evaluated under the same experimental settings shows significant improvements. In fact, the baseline outperforms or performs competitively with methods that claimed to be superior to the very same baseline method when they were introduced. Hence, by exposing this flaw in experimental procedure, we highlight the importance of using identical experimental setups to evaluate, compare, and benchmark methods in Bayesian Deep Learning.
Motivation & Objective
- To investigate the impact of inconsistent experimental settings on the evaluation of Bayesian deep learning methods.
- To identify and correct a common flaw in benchmarking: comparing models trained to convergence against baselines trained for only 40 epochs.
- To demonstrate that well-tuned Monte Carlo dropout, a standard baseline, performs competitively or better than claimed SOTA methods when evaluated under the same conditions.
- To advocate for rigorous, consistent experimental setups in Bayesian deep learning research to ensure valid comparisons and reliable claims of improvement.
Proposed method
- Re-evaluated regression experiments on UCI datasets using the same experimental protocol as recent SOTA methods, including training to convergence.
- Trained Monte Carlo dropout models under the same hyperparameters and training duration as the methods being compared.
- Used standard evaluation metrics: RMSE and predictive log-likelihood on test sets.
- Re-implemented and retrained baseline models (e.g., VMG, HS-BNN, PBP-MV, SGHMC) under the convergence setting for fair comparison.
- Performed hyperparameter tuning for MC dropout across all datasets to ensure optimal performance.
- Compared results directly with published values from original papers to isolate the effect of training duration and setup.
Experimental results
Research questions
- RQ1Does training to convergence significantly improve the performance of Monte Carlo dropout compared to fixed-epoch training?
- RQ2How do the performance rankings of Bayesian deep learning methods change when evaluated under identical experimental conditions?
- RQ3To what extent do prior claims of SOTA performance rely on unfair comparisons with undertrained baselines?
- RQ4Can a standard baseline like MC dropout outperform more complex methods when both are trained under the same conditions?
- RQ5What is the impact of inconsistent training protocols on the validity of empirical claims in Bayesian deep learning research?
Key findings
- Monte Carlo dropout, when trained to convergence, achieves state-of-the-art or near-state-of-the-art performance on multiple UCI regression datasets.
- On the Boston Housing, Concrete Strength, and Wine Quality Red datasets, MC dropout achieved the best log-likelihood scores, outperforming VMG, HS-BNN, and SGHMC.
- In RMSE, MC dropout outperformed VMG, HS-BNN, and SGHMC on Concrete Strength, Naval Propulsion Plants, Wine Quality Red, and Yacht Hydrodynamics.
- On the Energy Efficiency and Kin8nm datasets, MC dropout achieved the best or second-best performance, with hyperparameter-tuned versions achieving the lowest RMSE.
- The Naval Propulsion Plants dataset showed MC dropout achieving near-perfect performance (RMSE ≈ 0.00), outperforming all other methods.
- The results indicate that prior claims of superiority for methods like VMG, HS-BNN, and SGHMC were invalid due to unfair comparisons with undertrained baselines.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.