QUICK REVIEW

[论文解读] A Survey of FPGA-Based Neural Network Accelerator

Kaiyuan Guo, Shulin Zeng|arXiv (Cornell University)|Dec 24, 2017

Advanced Neural Network Applications参考文献 73被引用 140

一句话总结

本综述评估基于 FPGA 的神经网络推理加速器，详细介绍硬件-软件技术、模型压缩和提升速度与能源效率的架构策略，并将其与 GPU 进行对比。

ABSTRACT

Recent researches on neural network have shown significant advantage in machine learning over traditional algorithms based on handcrafted features and models. Neural network is now widely adopted in regions like image, speech and video recognition. But the high computation and storage complexity of neural network inference poses great difficulty on its application. CPU platforms are hard to offer enough computation capacity. GPU platforms are the first choice for neural network process because of its high computation capacity and easy to use development frameworks. On the other hand, FPGA-based neural network inference accelerator is becoming a research topic. With specifically designed hardware, FPGA is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA-based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this paper, we give an overview of previous work on neural network inference accelerators based on FPGA and summarize the main techniques used. An investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work.

研究动机与目标

评估基于 FPGA 的NN 推理相对于 CPU/GPU 的挑战与机遇。
总结在 FPGA 上实现高吞吐量和能源效率的软件与硬件优化技术。
分析面向硬件的模型压缩方法及其对精度和性能的影响。
评估跨计算单元、循环展开和系统集成的架构设计策略。
为未来基于 FPGA 的神经网络加速器开发提供指导。

提出的方法

提出一个用于分析能效的简单 FPGA 基 NN 加速器性能模型。
回顾数据量化、权重减小与剪枝等硬件导向的压缩技术。
描述具有定点和异构比特宽度策略的计算单元设计。
讨论快速卷积方法（DFT/FFT、Winograd）及卷积层的频率优化。
解释循环展开、分组处理和流水线策略以提升吞吐量和利用率。
比较最先进的基于 FPGA 的神经网络加速器设计以推断可实现的性能。

实验结果

研究问题

RQ1在基于 FPGA 的神经网络推理中实现高吞吐量和高能效的核心设计挑战是什么？
RQ2面向硬件的模型压缩技术（量化、剪枝、低秩近似）对 FPGA 上的准确性和硬件性能有何影响？
RQ3哪些架构策略（计算单元、循环展开、内存组织）最有效地提升 FPGA 神经网络加速器的性能？
RQ4在 FPGA 实现中，快速卷积和频率优化方法有哪些优点与权衡？
RQ5在神经网络推理的能效方面，基于 FPGA 的加速器与 GPU 相比如何？

主要发现

通过利用模型量化和稀疏表示，基于 FPGA 的神经网络加速器能够实现高能效。
面向硬件的量化（线性和非线性）以及权重减小可以显著降低计算和内存成本。
低比特宽度的计算单元和异构比特宽度设计在进行适当训练/微调后可在降低资源使用的同时维持精度。
快速卷积方法（DFT/FFT 与 Winograd）为卷积层提供理论上的加速，需考虑内核大小和硬件约束。
高速 FPGA 设计利用激进的循环展开、分组处理和频率优化来提高吞吐量；具备优化内存访问的设计显示出更好的利用率。
从软件层面的模型压缩到硬件层面的架构设计的整体视角对最大化 FPGA NN 加速器性能至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。