QUICK REVIEW

[论文解读] Deep Learning on FPGAs: Past, Present, and Future

Griffin Lacey, Graham W. Taylor|arXiv (Cornell University)|Feb 13, 2016

CCD and CMOS Imaging Sensors参考文献 29被引用 154

一句话总结

本文综述了在 FPGA 上的深度学习，讨论了高层 OpenCL 工具链，并评估 CNN/MLP 的实现与设计流程，突出未来方向以及在功耗受限加速方面的潜力。

ABSTRACT

The rapid growth of data size and accessibility in recent years has instigated a shift of philosophy in algorithm design for artificial intelligence. Instead of engineering algorithms by hand, the ability to learn composable systems automatically from massive amounts of data has led to ground-breaking performance in important domains such as computer vision, speech recognition, and natural language processing. The most popular class of techniques used in these domains is called deep learning, and is seeing significant attention from industry. However, these models require incredible amounts of data and compute power to train, and are limited by the need for better hardware acceleration to accommodate scaling beyond current data and model sizes. While the current solution has been to use clusters of graphics processing units (GPU) as general purpose processors (GPGPU), the use of field programmable gate arrays (FPGA) provide an interesting alternative. Current trends in design tools for FPGAs have made them more compatible with the high-level software practices typically practiced in the deep learning community, making FPGAs more accessible to those who build and deploy models. Since FPGA architectures are flexible, this could also allow researchers the ability to explore model-level optimizations beyond what is possible on fixed architectures such as GPUs. As well, FPGAs tend to provide high performance per watt of power consumption, which is of particular importance for application scientists interested in large scale server-based deployment or resource-limited embedded applications. This review takes a look at deep learning and FPGAs from a hardware acceleration perspective, identifying trends and innovations that make these technologies a natural fit, and motivates a discussion on how FPGAs may best serve the needs of the deep learning community moving forward.

研究动机与目标

为深度学习超越 GPU 的硬件加速需求提供动机。
刻画 FPGA 作为面向 DL 工作负载的灵活、功耗高效的加速器的角色。
评估当前基于 FPGA 的 CNN/MLP 实现及设计权衡。
讨论高层次抽象工具与 OpenCL 的采用，以桥接 DL 与 FPGA 社区。
提出在 FPGA 上扩展 DL、改进工具和工作流的未来方向。

提出的方法

对 CNN 和 MLP 架构及其对 FPGA 加速的适用性进行回顾。
讨论 FPGA 的特性，包括可重构性、内存层次结构和流水线并行性。
分析高层次综合和 OpenCL 作为使 FPGA 面向 DL 研究者可访问的途径。
描述以 FPGA 为中心的设计流程，将 DL 模型与基于 OpenCL 的工作流集成。
考虑在 FPGA 硬件上进行训练与推理及相关的性能影响。

实验结果

研究问题

RQ1与通用处理器（GPP）和 GPU 相比，是什么使 FPGA 成为 DL 加速的一个有吸引力的平台？
RQ2DL 模型（尤其是 CNN 和 MLP）如何映射到 FPGA 架构，以及由此产生的性能/功耗权衡？
RQ3需要哪些工具和设计流程的发展来扩大 FPGA 在 DL 社区中的采用？
RQ4在多 FPGA 或受电力限制的环境中，扩展 DL 的近期和长期方向是什么？

主要发现

基于 FPGA 的最先进 CNN 实现达到每秒数十到数百张图像，功耗预算在数十瓦左右（例如在 Stratix V 平台上，ImageNet 1K 的 25 W 下实现 134 图像/秒）。
OpenCL 实现了对 FPGA、GPU 与 CPU 的跨硬件编程，尽管存在平台特定的限制，但促进了在 FPGA 上采用 DL 工作流。
高层设计工具和对 OpenCL 的支持正在扩大 FPGA 对研究人员和 DL 实践者的可访问性，将类似软件的 DL 工作流与可重配置硬件连接起来。
FPGA 提供流水线并行性和可定制的架构，在某些 DL 基元和流式工作负载的单位功耗性能方面可能优于固定 GPU。
未来方向包括更大内存、多 FPGA 配置，以及更抽象化的设计工具，以减少编译时间瓶颈并加速迭代。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。