QUICK REVIEW

[論文レビュー] Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets

Penghang Yin, Jiancheng Lyu|arXiv (Cornell University)|Mar 13, 2019

Domain Adaptation and Few-Shot Learning被引用数 120

ひとこと要約

適切な straight-through estimators (STERs) が、二層ネットワークでバイナリアクティベーションを前提とした母集団損失に対して下降方向を与える一方で、identity STE は不安定さを引き起こす可能性がある。実験では、clip ReLU STE が量子化されたネットワークで最も良い性能を示すことが多い。

ABSTRACT

Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass only, so that the "gradient" through the modified chain rule becomes non-trivial. Since this unusual "gradient" is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual "gradient" given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.

研究の動機と目的

Activation-quantized ネットワークの訓練研究動機と、分岐的に定数の活性化から生じる勾配の問題点を示す。
理論解析を可能にする、Binary activation と Gaussian inputs を用いた、扱いやすい二層線形 CNN モデルを定義する。
Backward pass における STE を介した coarse gradient を導入し、それが真の母集団勾配とどのように関係するかを分析する。
適切な STE の選択が下降方向と母集団損失の臨界点への収束をもたらす一方で、poor な STE（identity）は不安定性を生じ得ることを証明する。
理論的所見を MNIST および CIFAR-10 で実証し、実用的な STE の選択に情報を提供する。

提案手法

Binary activation と Gaussian 入力データを用いた二層CNN をモデル化する。
母集団損失 f(v,w) を Z による二乗損失の期待値として定義する。
backprop のゼロ微分を、代替の STE 微分 mu' に置き換えて coarse gradient g_mu を形成する。
vanilla ReLU および clipped ReLU STE に対して、負の期待値 coarse gradient が母集団勾配と相関し、下降方向を生じることを証明する。
identity STE は下降方向を提供せず、ある局所極小付近で不安定になる可能性があることを示す。
coarse gradient descent が母集団損失の臨界点へ収束することを示す収束結果を提供する。

実験結果

リサーチクエスチョン

RQ1適切な STE の選択は activation-quantized ネットワークにおいて母集団損失の下降方向を与えるか。
RQ2異なる STE による期待 coarse gradient は真の母集団勾配とどの程度相関するか。
RQ3STE を用いた coarse gradient descent は臨界点へ収束するのか、どの条件下で可能か。
RQ42-bit および 4-bit 活性化を用いた標準ベンチマーク（MNIST、CIFAR-10）で、異なる STE の実験的 performance はどうなるか。

主な発見

適切な STE の選択（vanilla ReLU および clipped ReLU）は、母集団損失の下降方向となる負の期待 coarse gradient を生み出す。
Identity STE は下降方向を保証せず、特定の局所極小付近で不安定さを引き起こす可能性がある。
ReLU または clipped ReLU を用いた coarse gradient descent は、適切な学習率の下で母集団損失の臨界点へ収束する。
不適切な STE は不安定性を引き起こし、良い極小点からの反発を招く可能性があり、CIFAR-10 実験と一致する。
経験的な結果は、 clip ReLU STE が深いネットワークで一般的に最良の性能を示し、LeNet-5 のような浅いネットワークでは vanilla ReLU が近い成績を示し、identity STE は最悪の性能を示すことを示唆している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。