QUICK REVIEW
[论文解读] Yes, but Did It Work?: Evaluating Variational Inference
Yuling Yao, Aki Vehtari|arXiv (Cornell University)|Feb 7, 2018
Computational and Text Analysis Methods被引用 67
一句话总结
本文提出两种变分推断诊断:PSIS 用于评估联合后验质量并校准估计,VSBC 用于评估 VI 点估计的平均校准,并给出关于重新参数化和实际阈值的指南。
ABSTRACT
While it's always possible to compute a variational approximation to a posterior distribution, it can be difficult to discover problems with this approximation. We propose two diagnostic algorithms to alleviate this problem. The Pareto-smoothed importance sampling (PSIS) diagnostic gives a goodness of fit measurement for joint distributions, while simultaneously improving the error in the estimate. The variational simulation-based calibration (VSBC) assesses the average performance of point estimates.
研究动机与目标
- Motivate the need to diagnose variational approximations beyond ELBO optimization.
- Introduce two diagnostics to evaluate VI: PSIS for joint posterior quality and VSBC for average point-estimate calibration.
- Provide practical guidance on interpretation, thresholds, and reparameterization.
- Demonstrate diagnostics across representative Bayesian models and VI settings.
提出的方法
- Propose PSIS (Pareto-smoothed importance sampling) to diagnose the quality of the VI approximation by inspecting the Pareto tail shape parameter k and using it to adjust estimates.
- Use PSIS as a diagnostic that also yields a stabilized estimator for expectations via weighted sums (equation form (3)).
- Introduce VSBC (variational simulation-based calibration) to assess the average calibration of VI-derived point estimates by simulating data from the prior and evaluating calibration probabilities.
- Discuss invariance of k under reparameterization and its relation to Rényi divergence between p and q.
- Examine when reparameterization improves VI via PSIS diagnostics and provide practical examples with common VI setups (e.g., ADVI, hierarchical models, logistic/linear regression).
- Outline limitations of each diagnostic and their complementary nature.
实验结果
研究问题
- RQ1Can PSIS diagnostics quantify the discrepancy between the VI posterior q(θ) and the true posterior p(θ|y) for a given data set?
- RQ2Can VSBC diagnostics assess whether the VI-derived point estimates are calibrated on average across data generated from the model?
- RQ3How do reparameterization and model structure affect the reliability of VI as diagnosed by PSIS and VSBC?
- RQ4What practical thresholds on the Pareto k and VSBC outcomes indicate reliable VI versus need for tuning or MCMC?
- RQ5How do these diagnostics perform across linear, logistic, hierarchical, and high-dimensional models?
主要发现
- PSIS provides a diagnostic shape parameter k that quantifies VI quality; small k (<0.5) suggests reliable PSIS convergence and q close to p, while larger k (>0.7) warns of unreliable VI and need for tuning or MCMC.
- PSIS-adjusted estimates (via weighted averages with smoothed weights) can reduce bias and variance compared to plain VI or plain IS, improving finite-sample performance.
- VSBC assesses average calibration of VI point estimates by testing symmetry of marginal calibration probabilities; findings can reveal bias in individual margins even when point estimates appear reasonable on average.
- Reparameterization can substantially alter VI quality, and PSIS can guide the choice of parametrization to reduce k and improve fit (e.g., non-centered parametrization in Eight-School example).
- VSBC distinguishes average unbiasedness from dataset-specific performance, highlighting that good performance on average does not guarantee accuracy for a particular realization, and vice versa.
- Applications demonstrate PSIS and VSBC across linear and logistic regression, hierarchical models, and high-dimensional cancer classification with horseshoe priors, showing where VI succeeds or fails and how diagnostics inform adjustments.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。