QUICK REVIEW

[论文解读] Estimating or Propagating Gradients Through Stochastic Neurons

Yoshua Bengio|arXiv (Cornell University)|May 14, 2013

Adversarial Robustness in Machine Learning参考文献 15被引用 81

一句话总结

本文提出了两种新型的梯度估计器家族，用于深度学习中的随机神经元，使通过不可微分的二值随机单元进行反向传播成为可能。第一种方法使用无偏、基于相关性的估计器，将随机神经元的输出视为与损失相关的扰动；第二种方法则学习对有偏估计器进行低方差校正，即使在无法反向传播的设置中也能实现高效的梯度估计。

ABSTRACT

Stochastic neurons can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic neurons, i.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and present two novel families of solutions, applicable in different settings. In particular, it is demonstrated that a simple biologically plausible formula gives rise to an an unbiased (but noisy) estimator of the gradient with respect to a binary stochastic neuron firing probability. Unlike other estimators which view the noise as a small perturbation in order to estimate gradients by finite differences, this estimator is unbiased even without assuming that the stochastic perturbation is small. This estimator is also interesting because it can be applied in very general settings which do not allow gradient back-propagation, including the estimation of the gradient with respect to future rewards, as required in reinforcement learning setups. We also propose an approach to approximating this unbiased but high-variance estimator by learning to predict it using a biased estimator. The second approach we propose assumes that an estimator of the gradient can be back-propagated and it provides an unbiased estimator of the gradient, but can only work with non-linearities unlike the hard threshold, but like the rectifier, that are not flat for all of their range. This is similar to traditional sigmoidal units but has the advantage that for many inputs, a hard decision (e.g., a 0 output) can be produced, which would be convenient for conditional computation and achieving sparse representations and sparse gradients.

研究动机与目标

解决通过随机神经元（尤其是具有不可微分激活函数的二值随机神经元）估计梯度的挑战。
开发可在不依赖平滑、可微分非线性函数的情况下，实现通过随机单元的梯度反向传播的方法。
提供一种无偏但计算高效的梯度估计器，适用于传统反向传播失效的场景，如强化学习或具有硬决策的模型。
通过学习一个校正函数，将低方差但有偏的估计器转化为偏差更小、方差更低的替代品，从而降低无偏估计器的高方差。
将所提出的估计器与现有框架（如玻尔兹曼机和SPSA）联系起来，展示其理论与实际相关性。

提出的方法

提出一种基于随机神经元输出与损失梯度之间相关性的无偏梯度估计器，使用公式 $ \mathbb{E}[X_i R] $，其中 $ X_i $ 为输出，$ R $ 为奖励。
引入玻尔兹曼机对数似然梯度的奖励-相关性解释，表明其为基于相关性的估计器的未归一化形式。
通过训练一个函数将有偏但低方差的估计器映射为偏差更小、方差更低的估计器，提出一种方差减少技术。
将该方法应用于二值随机神经元以及在整个范围内非平坦的非线性函数（如修正线性单元）。
在计算图框架中，将随机单元建模为 $ X_{it} \sim \text{Bernoulli}(\sigma(a_{it})) $，并应用定理1推导梯度估计器。
证明了针对偏置的估计器 $ X_i^+ - X_i^- $ 和针对权重的估计器 $ X_i^+X_j^+ - X_i^-X_j^- $ 恰好对应于玻尔兹曼机梯度估计器。

实验结果

研究问题

RQ1我们能否在不假设小扰动或平滑性的情况下，估计损失函数相对于随机神经元输入的梯度？
RQ2是否可以构建一种不依赖有限差分或小噪声近似的无偏梯度估计器，用于二值随机神经元？
RQ3如何在保持低计算成本的同时，降低随机神经网络中无偏梯度估计器的高方差？
RQ4玻尔兹曼机梯度是否可被解释为一种基于相关性的梯度估计器形式？这对训练随机网络有何启示？
RQ5所提出的基于相关性的估计器与现有方法（如SPSA或强化学习策略梯度）之间存在何种关系？

主要发现

通过神经元输出与损失之间的相关性，推导出适用于二值随机神经元的无偏梯度估计器，即使在不假设小扰动的情况下也成立。
所提出的估计器在计算上比标准反向传播更便宜，因为它避免了反向传播过程。
证明了玻尔兹曼机对数似然梯度等价于基于相关性的估计器的未归一化形式，为该学习规则提供了新的解释。
提出了一种方差减少技术，通过学习一个校正函数，将有偏但低方差的估计器转换为偏差更小、方差更低的估计器，同时保持方差水平并减少偏差。
该方法适用于传统反向传播失效的场景，如硬阈值单元或具有未来奖励估计的强化学习。
理论分析表明，基于相关性的估计器与SPSA本质上不同，因为其将扰动与奖励相乘，而非将奖励变化除以扰动。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。