QUICK REVIEW

[论文解读] Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets

Penghang Yin, Jiancheng Lyu|arXiv (Cornell University)|Mar 13, 2019

Domain Adaptation and Few-Shot Learning被引用 120

一句话总结

该论文证明：适当的直通估计器（STERs）为具有二值激活的两层网络的总体损失提供下降方向，而恒等 STE 可能导致不稳定；实验显示在量化网络中，裁剪的 ReLU STE 往往表现最佳。

ABSTRACT

Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass only, so that the "gradient" through the modified chain rule becomes non-trivial. Since this unusual "gradient" is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual "gradient" given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.

研究动机与目标

Motivate the study of training activation-quantized networks and the gradient issues arising from piecewise-constant activations.
Define a tractable two-linear-layer CNN model with binary activation and Gaussian inputs to enable theoretical analysis.
Introduce coarse gradient via STE in the backward pass and analyze its relationship to the true population gradient.
Prove that proper STE choices yield descent directions and convergence to critical points, while a poor STE (identity) can cause instability.
Empirically compare STEs on MNIST and CIFAR-10 to validate theoretical findings and inform practical STE choice.

提出的方法

Model a two-linear-layer CNN with binary activation and Gaussian input data.
Define population loss f(v,w) as the expectation of the squared loss over Z.
Replace the zero derivative in backprop with a surrogate STE derivative mu' to form a coarse gradient g_mu.
Prove that for vanilla ReLU and clipped ReLU STEs, the negative expected coarse gradient correlates with the population gradient and yields descent directions.
Show that identity STE does not provide a descent direction and can be unstable near certain minima.
Provide convergence results showing coarse gradient descent converges to a critical point of the population loss.

实验结果

研究问题

RQ1Does selecting a proper STE yield a descent direction for the population loss in activation-quantized networks?
RQ2How does the expected coarse gradient under different STEs correlate with the true population gradient?
RQ3Can coarse gradient descent with STEs converge to a critical point, and under what conditions?
RQ4How do different STE choices perform empirically on standard benchmarks (MNIST, CIFAR-10) with 2-bit and 4-bit activations?

主要发现

Proper STE choices (vanilla ReLU and clipped ReLU) yield negative expected coarse gradients that are descent directions for the population loss.
Identity STE does not guarantee descent and can lead to instability near certain local minima.
Coarse gradient descent using ReLU or clipped ReLU converges to a critical point of the population loss under suitable learning rates.
Poor STEs can cause instability and repulsion from good minima, consistent with CIFAR-10 experiments.
Empirical results show clipped ReLU STE generally performs best on deeper networks, with vanilla ReLU close on shallower networks like LeNet-5, and identity STE performing worst.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。