QUICK REVIEW

[论文解读] A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks

Yixing Li, Zichuan Liu|arXiv (Cornell University)|Feb 20, 2017

Advanced Neural Network Applications参考文献 4被引用 26

一句话总结

该论文提出了一款7.663-TOPS、8.2-W的FPGA加速器，专为二值卷积神经网络（BCNNs）优化，通过大规模空间并行和深度流水线实现，在小批量推理中相比Titan X GPU实现了8.3倍更高的吞吐量和75倍更好的能效。该设计对批量大小不敏感，在动态、低延迟场景下优于GPU。

ABSTRACT

FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized FPGA accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipelines stages. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size, while the performance of GPU acceleration varies largely depending on the batch size of the data. Experiment results show that the proposed accelerator architecture for binary CNNs running on a Virtex-7 FPGA is 8.3x faster and 75x more energy-efficient than a Titan X GPU for processing online individual requests in small batch sizes. For processing static data in large batch sizes, the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5x higher energy efficiency.

研究动机与目标

解决在FPGA上实现高吞吐量和高能效CNN推理相较于GPU所面临的挑战。
克服GPU加速器性能对批量大小高度依赖所带来的性能波动问题。
设计一种专用于二值卷积和归一化操作的FPGA架构。
实现实时、低延迟应用中高性能表现，此类应用通常采用小批量输入。

提出的方法

在Virtex-7 FPGA上实现定制FPGA加速器，通过大规模空间并行和深度流水线阶段，支持按位卷积运算。
通过利用位级并行和简化算术运算，对二值神经网络进行架构优化。
集成高效的归一化单元，以支持BCNN推理流水线中的批量归一化。
通过结构化设计确保在不同批量大小下保持一致的性能，最大限度减少吞吐量下降。
采用高层次综合与自定义内存访问模式，以最大化数据吞吐量并最小化延迟。
在资源利用率与时钟频率之间取得平衡，实现在低功耗（8.2 W）下高吞吐量。

实验结果

研究问题

RQ1基于FPGA的加速器能否在二值CNN推理中实现相比GPU更高的吞吐量和能效？
RQ2与GPU相比，FPGA加速器在不同输入批量大小下的性能如何变化？
RQ3空间并行和深度流水线在二值CNN加速器中能将吞吐量提升多少？
RQ4FPGA加速器能否在小批量推理中保持高性能，而小批量推理是在线应用中的常见场景？

主要发现

所提出的FPGA加速器在Virtex-7 FPGA上实现了7.663 TOPS的推理吞吐量。
在小批量推理（例如在线请求）场景下，该加速器比Titan X GPU快8.3倍。
在小批量场景下，FPGA解决方案相比Titan X GPU实现了75倍更高的能效。
在大批次静态数据处理场景下，该加速器的吞吐量与Titan X GPU相当，同时实现了9.5倍更好的能效。
该加速器的性能对批量大小不敏感，而GPU方案则相反。
该设计表明，当应用于二值CNN时，FPGA加速在吞吐量和能效方面可优于GPU加速。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。