QUICK REVIEW

[论文解读] Yes, but Did It Work?: Evaluating Variational Inference

Yuling Yao, Aki Vehtari|arXiv (Cornell University)|Feb 7, 2018

Computational and Text Analysis Methods被引用 67

一句话总结

本文提出两种变分推断诊断：PSIS 用于评估联合后验质量并校准估计，VSBC 用于评估 VI 点估计的平均校准，并给出关于重新参数化和实际阈值的指南。

ABSTRACT

While it's always possible to compute a variational approximation to a posterior distribution, it can be difficult to discover problems with this approximation. We propose two diagnostic algorithms to alleviate this problem. The Pareto-smoothed importance sampling (PSIS) diagnostic gives a goodness of fit measurement for joint distributions, while simultaneously improving the error in the estimate. The variational simulation-based calibration (VSBC) assesses the average performance of point estimates.

研究动机与目标

Motivate the need to diagnose variational approximations beyond ELBO optimization.
Introduce two diagnostics to evaluate VI: PSIS for joint posterior quality and VSBC for average point-estimate calibration.
Provide practical guidance on interpretation, thresholds, and reparameterization.
Demonstrate diagnostics across representative Bayesian models and VI settings.

提出的方法

Propose PSIS (Pareto-smoothed importance sampling) to diagnose the quality of the VI approximation by inspecting the Pareto tail shape parameter k and using it to adjust estimates.
Use PSIS as a diagnostic that also yields a stabilized estimator for expectations via weighted sums (equation form (3)).
Introduce VSBC (variational simulation-based calibration) to assess the average calibration of VI-derived point estimates by simulating data from the prior and evaluating calibration probabilities.
Discuss invariance of k under reparameterization and its relation to Rényi divergence between p and q.
Examine when reparameterization improves VI via PSIS diagnostics and provide practical examples with common VI setups (e.g., ADVI, hierarchical models, logistic/linear regression).
Outline limitations of each diagnostic and their complementary nature.

实验结果

研究问题

RQ1Can PSIS diagnostics quantify the discrepancy between the VI posterior q(θ) and the true posterior p(θ|y) for a given data set?
RQ2Can VSBC diagnostics assess whether the VI-derived point estimates are calibrated on average across data generated from the model?
RQ3How do reparameterization and model structure affect the reliability of VI as diagnosed by PSIS and VSBC?
RQ4What practical thresholds on the Pareto k and VSBC outcomes indicate reliable VI versus need for tuning or MCMC?
RQ5How do these diagnostics perform across linear, logistic, hierarchical, and high-dimensional models?

主要发现

PSIS provides a diagnostic shape parameter k that quantifies VI quality; small k (<0.5) suggests reliable PSIS convergence and q close to p, while larger k (>0.7) warns of unreliable VI and need for tuning or MCMC.
PSIS-adjusted estimates (via weighted averages with smoothed weights) can reduce bias and variance compared to plain VI or plain IS, improving finite-sample performance.
VSBC assesses average calibration of VI point estimates by testing symmetry of marginal calibration probabilities; findings can reveal bias in individual margins even when point estimates appear reasonable on average.
Reparameterization can substantially alter VI quality, and PSIS can guide the choice of parametrization to reduce k and improve fit (e.g., non-centered parametrization in Eight-School example).
VSBC distinguishes average unbiasedness from dataset-specific performance, highlighting that good performance on average does not guarantee accuracy for a particular realization, and vice versa.
Applications demonstrate PSIS and VSBC across linear and logistic regression, hierarchical models, and high-dimensional cancer classification with horseshoe priors, showing where VI succeeds or fails and how diagnostics inform adjustments.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。