QUICK REVIEW

[论文解读] Stochastic Gradient Descent as Approximate Bayesian Inference

Stephan Mandt, Matthew D. Hoffman|arXiv (Cornell University)|Apr 13, 2017

Stochastic Gradient Optimization Techniques参考文献 39被引用 108

一句话总结

这篇论文将常步长 SGD 重新框定为一个随机过程，其平稳分布可近似贝叶斯后验，推导用于此目的的最优 SGD 超参数，并将视角扩展到动量、前置化以及基于 SGD 的 MCMC 变体。

ABSTRACT

Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly. (4) We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.

研究动机与目标

为常数 SGD 提供参数的近似后验采样器的概率解释。
推导 SGD 的最优超参数（学习率和前置化），以使 KL 散度最小化到后验。
展示动量和前置化如何影响用于近似推断的平稳分布。
在 OU 过程框架下开发变分EM 和可扩展的 MCMC 视角。
分析对迭代平均及随机梯度 MCMC 算法的影响。

提出的方法

将具有恒定学习率的 SGD 模型为围绕局部最优解的多变量 Ornstein-Uhlenbeck 过程。
假设高斯梯度噪声和二次局部损失以获得解析的平稳分布。
在平稳分布与高斯后验之间最小化 KL 散度以推导最优的 SGD 设置。
扩展到前置矩阵和对角变体以实现更好的后验匹配。
将带动量的 SGD 视为 OU 框架中的缩放协方差变换，以用于近似推断。
将基于 SGD 的后验与 BBVI 比较，并通过变分 EM 视角分析超参数优化。

实验结果

研究问题

RQ1常数 SGD 能否被调优为产生参数的近似贝叶斯后验分布？
RQ2应如何选择学习率和前置化以最小化到后验的 KL 散度？
RQ3动量对平稳分布有何影响及其用于近似采样的意义？
RQ4随机梯度 MCMC 方法（SGLD、SGFS）如何与 OU 过程框架下的 SGD 相关，其近似误差为何？
RQ5在此框架下，迭代平均是否能提供最佳的采样性质？

主要发现

常数-SGD 的平稳分布是高斯分布，且能够近似后验；KL 散度为最优超参数提供指导。
定理 1 给出 KL 最优性的最优标量学习率：ε* = 2S/N · D / Tr(BB^T)。
定理 2 显示最优全前置 H* = (2S/N)(BB^T)^{-1}，用于将平稳分布与后验匹配；对角变体也被表征。
动量缩放平稳协方差但保持其形状，可实现近似采样。
对于 SG-MCMC 方法，OU 过程视角证明了前置化的最优性并澄清了有限学习率误差；迭代平均可以产生近似最优采样器，但意味着数据遍历的线性成本。
在某些假设下，迭代平均可以在每个数据遍历获得恰好一个有效独立样本。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。