QUICK REVIEW

[論文レビュー] Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets

Penghang Yin, Jiancheng Lyu|arXiv (Cornell University)|Mar 13, 2019

Domain Adaptation and Few-Shot Learning参考文献 42被引用数 62

ひとこと要約

この論文は、二値活性化とガウス入力を用いた二層モデルを分析することで activation-quantized ニューラルネットのための straight-through estimators (STE) を理論的に正当化し、適切な STE が降下方向を生み出す一方、identity STE のような不適切な選択は不安定性を引き起こす可能性があることを示している；また、MNIST/CIFAR-10 に対して vanilla ReLU STE と clipped ReLU STE を経験的に比較している。

ABSTRACT

Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass only, so that the "gradient" through the modified chain rule becomes non-trivial. Since this unusual "gradient" is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual "gradient" given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.

研究の動機と目的

Activation gradients がほぼ全ての点で消える activation-quantized neural networks の訓練における STE の動機付けと形式化
Binary activation と Gaussian inputs を用いた2層 CNN の降下方向と収束に対する STE の選択の理論的影響の分析
STE による粗い勾配が単調な降下をもたらすか安定性を崩すかを特定の極小付近で特徴づける
3つの STE（identity、vanilla ReLU、clipped ReLU）を標準ベンチマークで理論的にも経験的にも比較する
より深いネットワークにおける STE 選択の実用的影響を示す

提案手法

Binary activation と Gaussian inputs を伴う2層 CNN の母集団損失最適化として学習をモデル化する
zero derivative をバックワード伝搬での代替物 (mu') に置換することで STE による粗い勾配を定義する
母集団損失 f(v,w) とその勾配の解析表現を導出し、定常点と潜在的な最小値を研究する
適切な STE（vanilla ReLU および clipped ReLU）では負の期待粗い勾配が f の降下方向となり、臨界点へ収束することを証明する
identity STE は一般には降下方向を生み出さず、特定の局所最小付近で不安定性を引き起こす可能性があることを示す
MNIST および CIFAR-10 上で 2-bit および 4-bit の活性化でこの 3 STE を経験的に比較する

Figure 1: The plots of the empirical loss moving by one step in the direction of negative coarse gradient v.s. the learning rate (step size) $\eta$ for different sample sizes.

実験結果

リサーチクエスチョン

RQ1活性化量子化ネットワークにおいて適切な STE の選択が母集団損失の降下方向を保証できるか？
RQ2vanilla ReLU および clipped ReLU STE は identity STE と比較して収束性と安定性の点でどう異なるか？
RQ3浅いネットワークと深いネットワークで STE の選択が訓練ダイナミクスと最終精度に与える影響は？
RQ4異なる STE によって誘発される粗い勾配は母集団損失の真の勾配と正の相関があるか？
RQ5MNIST や CIFAR-10 など標準データセットでの実用的性能に対する STE の選択の影響は？

主な発見

適切に選択された STE（vanilla ReLU および clipped ReLU）は期待される粗い勾配と母集団勾配との間に正の相関を生み、降下方向を提供する
vanilla ReLU および clipped ReLU からの負の期待粗い勾配は母集団損失を単調に減少させ、臨界点へ収束する
identity STE はこの設定では不適切な選択であり、粗い勾配が消えない局所 minima 付近で不安定性を引き起こす可能性がある
実験では clipped ReLU STE が深いネットワーク（VGG-11、ResNet-20）で最も良い性能を示し、MNIST の LeNet-5 でも競争力のある結果を示した
CIFAR-10 では Identity および ReLU STE が不安定になり得ることが理論的な不安定性評価を裏付けた
全体として、 studied architectures では 2-bit および 4-bit の量子化に対して clipped ReLU STE が最もロバストである傾向がある

Figure 2: When initialized with weights (good minima) produced by the vanilla (orange) and clipped (blue) ReLUs on ResNet-20 with 4-bit activations, the coarse gradient descent using the identity STE ends up being repelled from there. The learning rate is set to $10^{-5}$ until epoch 20.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。