QUICK REVIEW

[论文解读] Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard|arXiv (Cornell University)|Aug 15, 2013

Stochastic Gradient Optimization Techniques参考文献 11被引用 2,001

一句话总结

作者比较四种梯度估计策略用于随机或非平滑神经元，并在条件计算设置中展示其用于门控网络部分的应用。

ABSTRACT

Stochastic neurons and hard non-linearities can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic or non-smooth neurons? I.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and compare four families of solutions, applicable in different settings. One of them is the minimum variance unbiased gradient estimator for stochatic binary neurons (a special case of the REINFORCE algorithm). A second approach, introduced here, decomposes the operation of a binary stochastic neuron into a stochastic binary part and a smooth differentiable part, which approximates the expected effect of the pure stochatic binary neuron to first order. A third approach involves the injection of additive or multiplicative noise in a computational graph that is otherwise differentiable. A fourth approach heuristically copies the gradient with respect to the stochastic output directly as an estimator of the gradient with respect to the sigmoid argument (we call this the straight-through estimator). To explore a context where these estimators are useful, we consider a small-scale version of {\em conditional computation}, where sparse stochastic units form a distributed representation of gaters that can turn off in combinatorially many ways large chunks of the computation performed in the rest of the neural network. In this case, it is important that the gating units produce an actual 0 most of the time. The resulting sparsity can be potentially be exploited to greatly reduce the computational cost of large deep networks for which conditional computation would be useful.

研究动机与目标

通过随机或非平滑神经元为条件计算提供梯度估计的动机。
评审并比较四大梯度估计族：无偏梯度估计量、随机-平滑分解、噪声注入的可微图、以及直通（straight-through）方法。
展示使用随机门控对大规模网络的选择性激活的可行性。
评估提出方法在具有稀疏约束的门控/专家架构上的实际性能。

提出的方法

将随机神经元表述为 h_i = f(a_i, z_i) 并推导梯度流的机会。
引入四种方法：(i) 随机二值神经元的无偏梯度估计量（类似 REINFORCE）；(ii) 将随机二值神经元分解为一个随机二值部分和一个平滑的一阶近似；(iii) 注入噪声以创建一个可微分图；(iv) 直通估计量在二值/随机门控中传播梯度。
提出并分析 Noisy Rectifier、STS（Stochastic Times Smooth）、ST（Straight-Through）以及基于无偏的 REINFORCE 估计量。
通过以居中估计量和单位特定基线实现方差降低来讨论无偏梯度的改进。

实验结果

研究问题

RQ1我们是否能够有效地对随机或非平滑神经元进行反向传播？
RQ2哪些梯度估计量能为随机二值或门控单元提供无偏或低方差的更新？
RQ3随机门控是否能够实现有意义的条件计算并带来潜在的计算节省？
RQ4在使用 MNIST 的门控/专家网络上，这些估计量在实际中的表现如何？

主要发现

无偏梯度估计量对随机二值神经元的梯度相对于期望损失的梯度是无偏的。
STS 单元和 Noisy Rectifier 展现出有利的特性，并在随机门控中实现梯度流。
直通估计在实际中表现出惊人的良好性能，尽管存在偏差，通常在实验中得到最佳的验证/测试结果。
用随机门控对门控器进行条件化可将计算量减少约 10% 的单位，同时对性能的影响适中。
所有测试的估计量均可使训练继续进行；注入噪声可以同时改善训练目标和泛化能力。
在报告的 MNIST 实验中，直通单元获得了最佳的验证和测试误差。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。