QUICK REVIEW

[论文解读] SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

Angshuman Parashar, Minsoo Rhu|arXiv (Cornell University)|May 23, 2017

Advanced Neural Network Applications参考文献 17被引用 124

一句话总结

SCNN 是一个 CNN 推理加速器，使用权重和激活的压缩稀疏编码，以及 PT-IS-CP-sparse 数据流，相对于密集加速器实现显著的速度和能量提升。它部署了 64 个处理单元，具备 1024 个乘法器，强调片上数据重用与稀疏计算。

ABSTRACT

Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs in a wide range of situations, especially mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator applied during inference. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. Furthermore, the SCNN dataflow facilitates efficient delivery of those weights and activations to the multiplier array, where they are extensively reused. In addition, the accumulation of multiplication products are performed in a novel accumulator array. Our results show that on contemporary neural networks, SCNN can improve both performance and energy by a factor of 2.7x and 2.3x, respectively, over a comparably provisioned dense CNN accelerator.

研究动机与目标

通过利用权重裁剪和激活稀疏性，推动并实现高效的 CNN 推理。
开发数据流和硬件架构，使稀疏数据在片上保持压缩并复用。
最小化数据移动并通过不对乘法零乘数进行运算来避免浪费。
在吞吐量、能耗和面积方面，将稀疏 CNN 加速器与密集对手进行对比评估。

提出的方法

引入在权重和激活的压缩-稀疏块上运行的 PT-IS-CP-sparse 数据流。
使用笛卡尔积乘法阵列计算所有非零成对乘积。
使用散点累加网络在正确的坐标处对部分乘积进行求和。
构建板上内存层次结构，包含 IARAM/OARAM 及分布式累加器槽阵列，使数据保持在本地。
以压缩-稀疏形式表示输出，并应用 halo 处理和层级排序以管理跨处理单元的 tiling。
提供环节级仿真和分析建模(TimeLoop)，以在密集与稀疏架构之间进行比较。

实验结果

研究问题

RQ1在专用加速器上利用权重与激活稀疏性，如何影响 CNN 推理性能与能耗？
RQ2哪种数据流和硬件设计最能有效利用权重、输入和输出的压缩-稀疏表示？
RQ3在可比资源下，与密集设计相比，稀疏 CNN 加速器的面积、速度与能耗权衡如何？
RQ4使用压缩表示，常用网络（如 AlexNet、GoogLeNet）等的所有激活是否能在片上容纳？
RQ5halo/tiles 策略如何影响稀疏 CNN 加速的可扩展性与能效？

主要发现

64-PE 的 SCNN 配置，具备 1024 个乘法器，可以达到约 2 Tera-ops 峰值吞吐量。
SCNN 相对于同等配置的密集 CNN 加速器，提供约 2.7 倍的加速和 2.3 倍的能耗降低。
通过压缩-稀疏编码和 PT-IS-CP-sparse 数据流来利用激活与权重稀疏性，从而消除不必要的乘法。
SCNN 设计使用 1 MB 的片上激活 RAM（IARAM+OARAM），并通过分布式银行阵列累积部分和以支持稀疏计算。
单个 SCNN PE 的面积约为 0.123 mm^2，完整的 64-PE 加速器估计为 7.9 mm^2，主要由存储需求驱动。
该架构通过 TimeLoop 分析和逐周期仿真器，为性能/功耗估算提供 tiling 与 DRAM 访问能量的建模。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。