QUICK REVIEW

[Paper Review] Black-Box Alpha Divergence Minimization

José Miguel Hernández-Lobato, Yingzhen Li|arXiv (Cornell University)|Nov 10, 2015

Gaussian Processes and Bayesian Inference25 references41 citations

TL;DR

This paper introduces Black-Box Alpha (BB-α), a scalable approximate inference method that minimizes α-divergence using stochastic gradient descent. By leveraging automatic differentiation and Monte Carlo approximation, BB-α enables black-box application to complex models, outperforming standard variational Bayes (α→0) and expectation propagation (α=1) on neural network and regression tasks, especially at α=0.5.

ABSTRACT

Black-box alpha (BB-$\\alpha$) is a new approximate inference method based on the minimization of $\\alpha$-divergences. BB-$\\alpha$ scales to large datasets because it can be implemented using stochastic gradient descent. BB-$\\alpha$ can be applied to complex probabilistic models with little effort since it only requires as input the likelihood function and its gradients. These gradients can be easily obtained using automatic differentiation. By changing the divergence parameter $\\alpha$, the method is able to interpolate between variational Bayes (VB) ($\\alpha \ ightarrow 0$) and an algorithm similar to expectation propagation (EP) ($\\alpha = 1$). Experiments on probit regression and neural network regression and classification problems show that BB-$\\alpha$ with non-standard settings of $\\alpha$, such as $\\alpha = 0.5$, usually produces better predictions than with $\\alpha \ ightarrow 0$ (VB) or $\\alpha = 1$ (EP).

Motivation & Objective

To develop a scalable, black-box inference method that avoids the memory and convergence issues of traditional EP.
To enable application of power EP (via α-divergence minimization) to large-scale and complex probabilistic models without analytic energy forms.
To provide a unified framework that interpolates between variational Bayes (α→0) and EP (α=1), with improved predictive performance.
To ensure convergence and scalability through a differentiable energy function and stochastic gradient descent.
To empirically validate that non-standard α values (e.g., α=0.5) yield better predictions than α=0 or α=1.

Proposed method

BB-α minimizes the α-divergence between a tractable approximation q and the true posterior p(θ|D), using a parameterized energy function derived from power EP.
The method uses Monte Carlo approximation to estimate the intractable expectation in the α-divergence objective, enabling black-box use.
Gradients of the objective are computed via automatic differentiation, allowing end-to-end optimization with stochastic gradient descent.
The algorithm is designed to be memory-efficient by avoiding per-factor storage, unlike standard EP.
It supports arbitrary α ∈ (0,1), with α→0 recovering variational Bayes and α=1 recovering EP-like behavior.
The energy function is analytically tractable and differentiable, enabling convergence guarantees and efficient optimization.

Experimental results

Research questions

RQ1Can α-divergence minimization be made scalable and black-box for complex models with intractable energy functions?
RQ2Does BB-α outperform standard variational Bayes (α→0) and expectation propagation (α=1) in predictive accuracy?
RQ3How does the choice of α affect predictive performance across different models and datasets?
RQ4What is the trade-off between gradient bias and variance in the Monte Carlo approximation of the objective?
RQ5Can BB-α be efficiently optimized using stochastic gradient descent without double-loop procedures?

Key findings

BB-α with α=0.5 consistently outperforms both variational Bayes (α→0) and EP (α=1) in predictive performance on probit regression and neural network tasks.
For α=0.5, the average test RMSE on the Boston housing dataset was significantly lower than for α=1.0 or α=10⁻⁶.
Gradient bias in BB-α decreases rapidly with increasing Monte Carlo samples K, dropping to near-zero levels at K=10.
The standard deviation of the gradient estimate remains high (≈12–14) but is several orders of magnitude larger than the bias, indicating that bias is negligible in practice.
At K=10, the bias for α=0.5 was only 0.0013, and for α=1.0 it was 0.0077, showing low sensitivity to α choice in gradient estimation.
BB-α achieves state-of-the-art predictive performance on both small and large datasets, demonstrating scalability and robustness.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.