QUICK REVIEW

[論文レビュー] Estimating or Propagating Gradients Through Stochastic Neurons

Yoshua Bengio|arXiv (Cornell University)|May 14, 2013

Adversarial Robustness in Machine Learning参考文献 15被引用数 81

ひとこと要約

本稿では、深層学習における確率的ニューロンのための2つの新しい勾配推定器の族を提案し、微分不能でバイナリな確率的ユニットを介したバックプロパゲーションを可能にする。最初の手法は、不偏で相関に基づく推定器を用い、確率的ニューロンの出力を損失の勾配と相関させるものである。第二の手法は、バイアスのある推定器に対する低分散補正を学習することで、バックプロパゲーションが不可能な状況でも効率的な勾配推定を達成する。

ABSTRACT

Stochastic neurons can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic neurons, i.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and present two novel families of solutions, applicable in different settings. In particular, it is demonstrated that a simple biologically plausible formula gives rise to an an unbiased (but noisy) estimator of the gradient with respect to a binary stochastic neuron firing probability. Unlike other estimators which view the noise as a small perturbation in order to estimate gradients by finite differences, this estimator is unbiased even without assuming that the stochastic perturbation is small. This estimator is also interesting because it can be applied in very general settings which do not allow gradient back-propagation, including the estimation of the gradient with respect to future rewards, as required in reinforcement learning setups. We also propose an approach to approximating this unbiased but high-variance estimator by learning to predict it using a biased estimator. The second approach we propose assumes that an estimator of the gradient can be back-propagated and it provides an unbiased estimator of the gradient, but can only work with non-linearities unlike the hard threshold, but like the rectifier, that are not flat for all of their range. This is similar to traditional sigmoidal units but has the advantage that for many inputs, a hard decision (e.g., a 0 output) can be produced, which would be convenient for conditional computation and achieving sparse representations and sparse gradients.

研究の動機と目的

微分不能な活性化関数を有するバイナリ確率的ニューロンを介した勾配推定の課題に対処すること。
滑らかで微分可能な非線形性に依存せずに、確率的ユニットを介した勾配のバックプロパゲーションを可能にする手法を開発すること。
従来のバックプロパゲーションが失敗する状況（例：強化学習やハードディシジョンを含むモデル）においても適用可能な、不偏かつ計算効率の良い勾配推定器を提供すること。
不偏推定器の高い分散を、バイアスのあるが低分散な推定器をより少ないバイアスと分散に変換する補正関数を学習することで低減すること。
提案された推定器をボルツマンマシンやSPSAといった既存の枠組みと結びつけ、理論的・実用的意義を示すこと。

提案手法

確率的ニューロンの出力と損失勾配の間の相関に基づく不偏勾配推定器を提案し、式 $ \mathbb{E}[X_i R] $ を用いる。ここで $ X_i $ は出力、$ R $ は報酬である。
ボルツマンマシンの対数尤度勾配を報酬-相関解釈として提示し、相関に基づく推定器の正規化されていない形であることを示す。
バイアスのあるが低分散な推定器を、より少ないバイアスと低い分散を持つ推定器に変換する補正関数を学習することで分散低減技術を開発する。
本手法をバイナリ確率的ニューロンおよび全範囲で平坦でない非線形関数（例：リLU）にも適用する。
確率的ユニットを $ X_{it} \sim \text{Bernoulli}(\sigma(a_{it})) $ としてモデル化する計算グラフフレームワークを用い、定理1を適用して勾配推定器を導出する。
バイアス用の推定器 $ X_i^+ - X_i^- $ および重み用の推定器 $ X_i^+X_j^+ - X_i^-X_j^- $ が、まさにボルツマンマシンの勾配推定器と一致することを示す。

実験結果

リサーチクエスチョン

RQ1小さな摂動や滑らかさの仮定をせず、確率的ニューロンの入力に関する損失関数の勾配を推定することは可能か？
RQ2有限差分や小さなノイズ近似に依存せずに、バイナリ確率的ニューロンのための不偏勾配推定器を構築することは可能か？
RQ3計算コストを低く保ちながら、確率的ニューラルネットワークにおける不偏推定器の高い分散をどのように低減できるか？
RQ4ボルツマンマシンの勾配は、相関に基づく勾配推定器の一形態として解釈可能か？その場合、確率的ネットワークの学習にどのような意味を持つのか？
RQ5提案された相関に基づく推定器と、SPSA や強化学習のポリシー勾配といった既存手法との関係は何か？

主な発見

小さな摂動の仮定をせずとも、ニューロンの出力と損失の間の相関を用いて、バイナリ確率的ニューロンの不偏勾配推定器を導出できる。
逆伝播のバックパスを回避するため、標準的なバックプロパゲーションより計算コストが低い。
ボルツマンマシンの対数尤度勾配が、相関に基づく推定器の正規化されていない形に等しいことが示され、その学習ルールに新たな解釈を与える。
バイアスのあるが低分散な推定器を、より少ないバイアスと低い分散を持つ推定器に変換する補正関数を学習する分散低減技術を提案。分散は維持しつつバイアスを低減する。
従来のバックプロパゲーションが失敗する状況（例：ハードスイッチングユニットや将来の報酬推定を伴う強化学習）においても本手法は適用可能である。
理論的分析により、相関に基づく推定器はSPSAとは本質的に異なることが示され、摂動と報酬の積を取るのに対し、SPSAは報酬の変化を摂動で除算するためである。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。