Skip to main content
QUICK REVIEW

[论文解读] Yes, but Did It Work?: Evaluating Variational Inference

Yuling Yao, Aki Vehtari|arXiv (Cornell University)|Feb 7, 2018
Computational and Text Analysis Methods被引用 67
一句话总结

本文提出两种变分推断诊断:PSIS 用于评估联合后验质量并校准估计,VSBC 用于评估 VI 点估计的平均校准,并给出关于重新参数化和实际阈值的指南。

ABSTRACT

While it's always possible to compute a variational approximation to a posterior distribution, it can be difficult to discover problems with this approximation. We propose two diagnostic algorithms to alleviate this problem. The Pareto-smoothed importance sampling (PSIS) diagnostic gives a goodness of fit measurement for joint distributions, while simultaneously improving the error in the estimate. The variational simulation-based calibration (VSBC) assesses the average performance of point estimates.

研究动机与目标

  • Motivate the need to diagnose variational approximations beyond ELBO optimization.
  • Introduce two diagnostics to evaluate VI: PSIS for joint posterior quality and VSBC for average point-estimate calibration.
  • Provide practical guidance on interpretation, thresholds, and reparameterization.
  • Demonstrate diagnostics across representative Bayesian models and VI settings.

提出的方法

  • Propose PSIS (Pareto-smoothed importance sampling) to diagnose the quality of the VI approximation by inspecting the Pareto tail shape parameter k and using it to adjust estimates.
  • Use PSIS as a diagnostic that also yields a stabilized estimator for expectations via weighted sums (equation form (3)).
  • Introduce VSBC (variational simulation-based calibration) to assess the average calibration of VI-derived point estimates by simulating data from the prior and evaluating calibration probabilities.
  • Discuss invariance of k under reparameterization and its relation to Rényi divergence between p and q.
  • Examine when reparameterization improves VI via PSIS diagnostics and provide practical examples with common VI setups (e.g., ADVI, hierarchical models, logistic/linear regression).
  • Outline limitations of each diagnostic and their complementary nature.

实验结果

研究问题

  • RQ1Can PSIS diagnostics quantify the discrepancy between the VI posterior q(θ) and the true posterior p(θ|y) for a given data set?
  • RQ2Can VSBC diagnostics assess whether the VI-derived point estimates are calibrated on average across data generated from the model?
  • RQ3How do reparameterization and model structure affect the reliability of VI as diagnosed by PSIS and VSBC?
  • RQ4What practical thresholds on the Pareto k and VSBC outcomes indicate reliable VI versus need for tuning or MCMC?
  • RQ5How do these diagnostics perform across linear, logistic, hierarchical, and high-dimensional models?

主要发现

  • PSIS provides a diagnostic shape parameter k that quantifies VI quality; small k (<0.5) suggests reliable PSIS convergence and q close to p, while larger k (>0.7) warns of unreliable VI and need for tuning or MCMC.
  • PSIS-adjusted estimates (via weighted averages with smoothed weights) can reduce bias and variance compared to plain VI or plain IS, improving finite-sample performance.
  • VSBC assesses average calibration of VI point estimates by testing symmetry of marginal calibration probabilities; findings can reveal bias in individual margins even when point estimates appear reasonable on average.
  • Reparameterization can substantially alter VI quality, and PSIS can guide the choice of parametrization to reduce k and improve fit (e.g., non-centered parametrization in Eight-School example).
  • VSBC distinguishes average unbiasedness from dataset-specific performance, highlighting that good performance on average does not guarantee accuracy for a particular realization, and vice versa.
  • Applications demonstrate PSIS and VSBC across linear and logistic regression, hierarchical models, and high-dimensional cancer classification with horseshoe priors, showing where VI succeeds or fails and how diagnostics inform adjustments.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。