[论文解读] Exploration of Low Numeric Precision Deep Learning Inference Using Intel FPGAs
本文提出了一种针对1位和2位数据宽度的低精度深度学习推理的定制FPGA硬件加速器,实现了高吞吐量和高能效。实验表明,采用2位激活和三值权重的AlexNet在ImageNet上实现了每秒3,700张图像的推理速度,top-1准确率为49%;在Stratix 10 FPGA上,通过优化实现接近单精度性能(准确率下降3.7%)的ResNet-34模型。
CNNs have been shown to maintain reasonable classification accuracy when quantized to lower precisions. Quantizing to sub 8-bit activations and weights can result in accuracy falling below an acceptable threshold. Techniques exist for closing the accuracy gap of limited numeric precision typically by increasing computation. This results in a trade-off between throughput and accuracy and can be tailored for different networks through various combinations of activation and weight data widths. Hardware architectures like FPGAs provide the opportunity for data width specific computation through unique logic configurations leading to highly optimized processing that is unattainable by full precision networks. Ternary and binary weighted networks offer an efficient method of inference for 2-bit and 1-bit data respectively. Most hardware architectures can take advantage of the memory storage and bandwidth savings that come along with smaller datapaths, but very few architectures can take advantage of limited numeric precision at the computation level. In this paper, we present a hardware design for FPGAs that takes advantage of bandwidth, memory, power, and computation savings of limited numerical precision data. We provide insights into the trade-offs between throughput and accuracy for various networks and how they map to our framework. Further, we show how limited numeric precision computation can be efficiently mapped onto FPGAs for both ternary and binary cases. Starting with Arria 10, we show a 2-bit activation and ternary weighted AlexNet running in hardware that achieves 3,700 images per second on the ImageNet dataset with a top-1 accuracy of 0.49. Using a hardware modeler designed for our low numeric precision framework we project performance most notably for a 55.5 TOPS Stratix 10 device running a modified ResNet-34 with only 3.7% accuracy degradation compared with single precision.
研究动机与目标
- 探索使用FPGA实现深度神经网络中低数值精度推理(1-2位)的可行性及其性能权衡。
- 设计一种硬件优化框架,利用FPGA的可重构性,在8位以下精度下实现高效计算。
- 通过针对三值和二值权重及激活的定制化硬件映射,最小化量化网络的准确率下降。
- 评估多种网络架构(如AlexNet、ResNet-34)在吞吐量、功耗效率和准确率之间的权衡。
- 预测在高端FPGA(如Stratix 10)上实现接近单精度推理性能的潜力,且准确率损失极小。
提出的方法
- 基于Intel Arria 10和Stratix 10 FPGA,设计了一种专用于1位和2位数据宽度的可重构FPGA架构。
- 采用专用逻辑单元执行针对三值(±1)和二值(±1)权重以及2位激活的低精度乘加运算。
- 通过减小数据路径宽度,优化内存带宽和存储,充分利用FPGA按操作定制数据宽度的能力。
- 使用硬件建模框架对不同网络配置和FPGA设备的性能进行仿真与预测。
- 应用量化感知训练原则,以在低精度模型中保持准确率,尤其针对ResNet-34模型。
- 通过自定义流水线和并行化策略将网络层映射到FPGA资源,以最大化吞吐量。
实验结果
研究问题
- RQ1如何优化基于FPGA的硬件,以高效执行1位和2位推理操作?
- RQ2在FPGA上进行低精度深度学习推理时,吞吐量、能效与准确率之间的权衡关系如何?
- RQ3将三值和二值权重网络映射到FPGA硬件时,其准确率能保持到何种程度?
- RQ4低精度推理性能在不同FPGA设备和网络架构上的可扩展性如何?
- RQ5通过FPGA优化的硬件,能否在子8位精度下实现接近单精度的准确率?
主要发现
- 采用2位激活和三值权重的AlexNet在ImageNet上实现了每秒3,700张图像的推理速度,top-1准确率为49%。
- 由于数据路径宽度减小,FPGA设计显著节省了内存和带宽,从而实现了更高的吞吐量。
- 在Stratix 10 FPGA上运行的改进版ResNet-34模型,与单精度推理相比仅出现3.7%的准确率下降。
- 硬件建模器预测,Stratix 10设备在低精度推理下最高可提供55.5 TOPS的性能。
- 该框架表明,FPGA能够高效利用子8位精度下的计算级优化,而这是大多数通用硬件无法实现的。
- 结果表明,通过定制化硬件映射和量化技术,FPGA上的低精度推理可在准确率损失极小的情况下实现。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。