QUICK REVIEW

[论文解读] Variational Gaussian Dropout is not Bayesian

Jiri Hron, Alexander Matthews|arXiv (Cornell University)|Nov 8, 2017

Gaussian Processes and Bayesian Inference参考文献 10被引用 29

一句话总结

本文证明，尽管变分高斯丢弃（Variational Gaussian Dropout）被表述为贝叶斯推断，但其并非有效的贝叶斯学习，原因在于使用了不恰当的对数-均匀先验，导致后验分布不 proper。作者推导了伪-KL 散度的精确解析表达式，表明该方法倾向于高方差后验，并揭示后续工作中引入的加法参数化方式在原始乘法形式中并不存在的虚假极小值。

ABSTRACT

Gaussian multiplicative noise is commonly used as a stochastic regularisation technique in training of deterministic neural networks. A recent paper reinterpreted the technique as a specific algorithm for approximate inference in Bayesian neural networks; several extensions ensued. We show that the log-uniform prior used in all the above publications does not generally induce a proper posterior, and thus Bayesian inference in such models is ill-posed. Independent of the log-uniform prior, the correlated weight noise approximation has further issues leading to either infinite objective or high risk of overfitting. The above implies that the reported sparsity of obtained solutions cannot be explained by Bayesian or the related minimum description length arguments. We thus study the objective from a non-Bayesian perspective, provide its previously unknown analytical form which allows exact gradient evaluation, and show that the later proposed additive reparametrisation introduces minima not present in the original multiplicative parametrisation. Implications and future research directions are discussed.

研究动机与目标

挑战先前工作中对变分高斯丢弃的贝叶斯解释。
证明变分丢弃中使用的对数-均匀先验会导致后验分布不 proper，从而使贝叶斯推断变得病态。
为变分后验与不 proper 先验之间的伪-KL 散度提供精确的解析表达式。
表明 [10] 中引入的重参数化方式与原始 [6] 的公式相比改变了优化景观。
将变分丢弃重新解释为具有可解释优化动态的非贝叶斯正则化最大似然估计过程。

提出的方法

利用 digamma 函数和 Kummer 函数，推导出高斯变分后验与对数-均匀先验之间 KL 散度的精确解析表达式。
引入基于 Dawson 积分的连续、可微的伪-KL 梯度表达式，实现精确梯度计算。
分析乘法参数化 (θ, α) 与加法参数化 (μ, σ²) 下目标函数的行为，表明二者并不等价。
证明伪-KL 在参数 u = μ²/(2σ²) 上严格递增，意味着最小化过程倾向于 σ² → ∞ 或 μ = 0。
表明相关权重噪声近似在与不 proper 先验结合时，会导致后验退化为无限 KL 散度，从而否定贝叶斯解释。
将 ELBO 优化重新解释为在非标准测度下的惩罚最大似然估计，而非贝叶斯推断。

实验结果

研究问题

RQ1变分高斯丢弃中使用的对数-均匀先验是否会在贝叶斯神经网络中导致 proper 后验分布？
RQ2考虑到不 proper 先验，变分丢弃的目标函数能否被有意义地解释为近似贝叶斯推断？
RQ3参数化方式的选择（乘法 vs. 加法）如何影响优化景观及最终模型的稀疏性？
RQ4高斯后验与对数-均匀先验之间 KL 散度的精确解析形式是什么？
RQ5在此设置下，ELBO 的优化是否对应于一个定义良好的统计估计过程？

主要发现

对数-均匀先验在标准神经网络似然下导致后验不 proper，使贝叶斯推断变得病态。
后验的归一化常数为无穷大，通过在 w = 0 附近及尾部区域的积分可证明。
相关权重噪声近似与不 proper 先验结合时，会导致无限 KL 散度，从而否定贝叶斯解释。
利用 digamma 函数和 Kummer 函数，推导出伪-KL 散度的精确解析表达式，实现精确梯度计算。
目标函数在 u = μ²/(2σ²) 上严格递增，意味着最小化过程倾向于高后验方差或零均值。
文献 [10] 中引入的加法参数化方式在原始乘法公式中不存在的新极小值，解释了报告稀疏性差异的原因。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。