[论文解读] Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference
本文展示了8位和4位量化网络(权重和激活)在通过对预训练模型进行激活范围校准、对4位情况进行更长训练以及带有噪声感知的SGD技术的微调后,能够匹配甚至超越全精度ImageNet基线。
To realize the promise of ubiquitous embedded deep network inference, it is essential to seek limits of energy and area efficiency. To this end, low-precision networks offer tremendous promise because both energy and area scale down quadratically with the reduction in precision. Here we demonstrate ResNet-18, -34, -50, -152, Inception-v3, Densenet-161, and VGG-16bn networks on the ImageNet classification benchmark that, at 8-bit precision exceed the accuracy of the full-precision baseline networks after one epoch of finetuning, thereby leveraging the availability of pretrained models. We also demonstrate ResNet-18, -34, -50, -152, Densenet-161, and VGG-16bn 4-bit models that match the accuracy of the full-precision baseline networks -- the highest scores to date. Surprisingly, the weights of the low-precision networks are very close (in cosine similarity) to the weights of the corresponding baseline networks, making training from scratch unnecessary. We find that gradient noise due to quantization during training increases with reduced precision, and seek ways to overcome this noise. The number of iterations required by SGD to achieve a given training error is related to the square of (a) the distance of the initial solution from the final plus (b) the maximum variance of the gradient estimates. Therefore, we (a) reduce solution distance by starting with pretrained fp32 precision baseline networks and fine-tuning, and (b) combat gradient noise introduced by quantization by training longer and reducing learning rates. Sensitivity analysis indicates that these simple techniques, coupled with proper activation function range calibration to take full advantage of the limited precision, are sufficient to discover low-precision networks, if they exist, close to fp32 precision baseline networks. The results herein provide evidence that 4-bits suffice for classification.
研究动机与目标
- 通过低精度网络推动能源和面积高效的嵌入式推理。
- 表明8位网络在经过最少程度的微调后可超过FP32基线。
- 表明4位网络在多个架构上可以达到与FP32基线相当的性能。
- 提供证据表明量化后预训练的高精度权重仍接近其低精度对应物。
- 分析量化导致的梯度噪声,并通过微调和校准提出缓解策略。
提出的方法
- 将预训练的FP32网络量化为权重和激活的8位和4位定点表示。
- 使用未量化模型的轻量前向传播对每一层的激活范围进行校准。
- 从预训练权重开始微调量化网络(FAQ: Fine-tuning After Quantization)。
- 使用固定点量化器 Q_{b,l},对每一层进行校准,并对权重/激活施加8/4位约束。
- 在训练过程中通过量化回传使用直通近似估计器。
- 对于4位网络,将训练扩展到110个时期,并采用学习率调度和调整后的权重衰减以降低梯度噪声。
实验结果
研究问题
- RQ1将高精度的预训练网络量化为8位或4位后,是否可以通过微调达到或超过其在ImageNet上的全精度准确度?
- RQ2量化引起的梯度噪声如何影响训练,以及哪些简单策略(如更大的批量、延长训练、学习率调度)可以缓解?
- RQ3是否存在在多种架构上达到与全精度基线等价的4位网络?
- RQ4最终的低精度解是否位于原始高精度初始化附近,暗示无需从零开始训练?
- RQ5FAQ方法是否能推广到ImageNet以外的其它数据集(例如CIFAR-10)?
主要发现
| 网络 | 方法 | 精度(权重、激活) | Top-1 准确度 (%) | Top-5 准确度 (%) |
|---|---|---|---|---|
| ResNet-18 | Baseline | 32,32 | 69.76 | 89.08 |
| ResNet-18 | FAQ (This paper) | 8,8 | 70.02 | 89.32 |
| ResNet-18 | FAQ (This paper) | 4,4 | 69.78 ± 0.04 | 89.11 ± 0.03 |
| ResNet-34 | Baseline | 32,32 | 73.30 | 91.42 |
| ResNet-34 | FAQ (This paper) | 8,8 | 73.71 | 91.63 |
| ResNet-34 | FAQ (This paper) | 4,4 | 73.31 | 91.32 |
- 在多个架构上,8位网络在经过一轮微调后便超过其全精度基线。
- 在ResNet-18、ResNet-34、ResNet-50、ResNet-152、DenseNet-161和VGG-16bn上,4位网络达到与全精度基线相当的准确度。
- 量化引入的梯度噪声会随精度降低而增大,影响微调,尤其在4位时。
- 从预训练的FP32网络开始并进行微调(FAQ)有助于在高精度初始化附近定位近似最优的低精度解。
- 更长的微调(110个epochs)和更大的批量有助于提升4位性能;校准后的激活范围至关重要(如首层/末层保持8位)。
- 余弦相似性分析显示,FAQ后的4位权重与初始的FP32权重高度相似,表明解位于高精度区域附近。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。