QUICK REVIEW

[论文解读] SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing

Ao Ren, Ji Li|arXiv (Cornell University)|Nov 18, 2016

Error Correcting Code Techniques参考文献 35被引用 35

一句话总结

本文提出SC-DCNN，这是首个使用随机计算（SC）实现深度卷积神经网络（DCNNs）的综合性框架，实现了超低硬件面积和高能效。通过利用SC能够通过简单逻辑门（AND和MUX）执行乘法和加法的特性，SC-DCNN针对内积、池化和激活函数设计了优化的功能模块，联合优化了特征提取单元，并采用高效的权重存储方式——最终实现的LeNet5模型仅消耗17 mm²面积和1.53 W功耗，同时达到781,250张图像/秒的吞吐量和510,734张图像/J的能效。

ABSTRACT

With recent advancing of Internet of Things (IoTs), it becomes very attractive to implement the deep convolutional neural networks (DCNNs) onto embedded/portable systems. Presently, executing the software-based DCNNs requires high-performance server clusters in practice, restricting their widespread deployment on the mobile devices. To overcome this issue, considerable research efforts have been conducted in the context of developing highly-parallel and specific DCNN hardware, utilizing GPGPUs, FPGAs, and ASICs. Stochastic Computing (SC), which uses bit-stream to represent a number within [-1, 1] by counting the number of ones in the bit-stream, has a high potential for implementing DCNNs with high scalability and ultra-low hardware footprint. Since multiplications and additions can be calculated using AND gates and multiplexers in SC, significant reductions in power/energy and hardware footprint can be achieved compared to the conventional binary arithmetic implementations. The tremendous savings in power (energy) and hardware resources bring about immense design space for enhancing scalability and robustness for hardware DCNNs. This paper presents the first comprehensive design and optimization framework of SC-based DCNNs (SC-DCNNs). We first present the optimal designs of function blocks that perform the basic operations, i.e., inner product, pooling, and activation function. Then we propose the optimal design of four types of combinations of basic function blocks, named feature extraction blocks, which are in charge of extracting features from input feature maps. Besides, weight storage methods are investigated to reduce the area and power/energy consumption for storing weights. Finally, the whole SC-DCNN implementation is optimized, with feature extraction blocks carefully selected, to minimize area and power/energy consumption while maintaining a high network accuracy level.

研究动机与目标

为解决在嵌入式和移动物联网设备上部署基于软件的DCNN时功耗高和硬件成本高的问题。
探索随机计算（SC）作为一种新范式，以实现在DCNN加速器中实现超低硬件面积和高能效。
设计一个自下而上的SC基础DCNN综合框架，优化功能模块、特征提取和权重存储，以实现最小面积和功耗。
通过联合优化和误差补偿，在保持网络高精度的同时，克服随机算术固有的不准确性。

提出的方法

使用随机计算将[-1, 1]范围内的实数表示为比特流，通过AND和MUX门实现低复杂度算术运算。
设计用于核心DCNN运算的专用硬件模块：内积（点积）、最大池化、平均池化以及激活函数（如ReLU）在SC域中的实现。
提出四种联合优化的特征提取模块架构，将基本功能模块与针对输入比特流长度和逻辑兼容性的定制配置相结合。
提出三种权重存储优化方案——比特切片、权重分组和比特反转编码，以最小化SRAM面积和功耗。
采用自下而上的设计方法，探索功能模块类型、池化方法和配置参数等设计空间。
通过在LeNet5上的实验评估，验证框架在多种配置下的性能、面积、功耗和能效表现。

实验结果

研究问题

RQ1随机计算能否有效应用于实现DCNN，显著降低硬件面积和能耗？
RQ2核心DCNN运算——内积、池化和激活——在随机计算域中如何实现高效且准确？
RQ3SC基础DCNN中，何种特征提取模块配置能在精度、面积和功耗效率之间实现最佳平衡？
RQ4不同的权重存储方案如何影响SC-DCNN的面积和能效？
RQ5尽管随机算术存在固有近似性，SC-DCNN在多大程度上仍能保持高网络精度？

主要发现

SC-DCNN框架实现了仅17 mm²面积和1.53 W功耗的LeNet5实现，支持在嵌入式系统中实现超紧凑部署。
所提出的SC-DCNN实现了781,250张图像/秒的吞吐量，展示了在实时推理中出色的计算效率。
该框架实现了45,946张图像/s/mm²的面积效率和510,734张图像/J的能效，显著优于传统GPU和CPU平台。
与Nvidia Tesla C2075相比，SC-DCNN（No.11）实现了15,625倍更高的吞吐量和159,604倍更高的能效。
采用MUX-based特征提取模块的配置实现了最低的面积和功耗，而APC-based模块则获得更高精度，支持根据应用需求进行权衡。
结果表明，随机算术引入的不准确性可在网络各层之间相互补偿，提示其在更大规模网络中具备进一步提升效率的潜力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。