QUICK REVIEW

[论文解读] Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

Hardik Sharma, Jongse Park|arXiv (Cornell University)|Dec 5, 2017

Advanced Neural Network Applications参考文献 40被引用 32

一句话总结

Bit Fusion 提出了一种在比特粒度上动态组合的 DNN 加速器，通过在比特级别融合处理单元以匹配各个 DNN 层的可变位宽，实现了无精度损失的计算量和内存流量显著减少。在 45 nm 工艺下，其性能相比 Eyeriss 提升 3.9 倍，能效提升 5.1 倍；在 16 nm 工艺下，其性能可媲美 250W 的 Titan Xp 显卡，功耗仅为 895 mW。

ABSTRACT

Fully realizing the potential of acceleration for Deep Neural Networks (DNNs) requires understanding and leveraging algorithmic properties. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent accuracy loss, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or lead to a degradation in final accuracy. To alleviate these deficiencies, this work introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing Bit Fusion, a bit-flexible accelerator, that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. We evaluate the benefits of BitFusion using eight real-world feed-forward and recurrent DNNs. The proposed microarchitecture is implemented in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN accelerators, Eyeriss and Stripes. In the same area, frequency, and process technology, BitFusion offers 3.9x speedup and 5.1x energy savings over Eyeriss. Compared to Stripes, BitFusion provides 2.6x speedup and 3.9x energy reduction at 45 nm node when BitFusion area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, BitFusion almost matches the performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while BitFusion merely consumes 895 milliwatts of power.

研究动机与目标

解决固定位宽 DNN 加速器在处理可变位宽操作时效率低下的问题，避免硬件利用率不足或精度下降。
利用算法洞察：DNN 可在每层使用更低位宽保持精度，从而实现细粒度优化。
设计一种支持运行时动态比特级融合与分解的硬件架构，以匹配每层 DNN 的位宽。
通过在每层使用最低所需位宽存储和处理数据，最小化计算和内存访问能耗。
证明比特级灵活性可在包括 CNN 和 RNN 在内的多种 DNN 工作负载中带来显著的性能与能效提升。

提出的方法

设计一种比特可调的加速器，由可动态融合或分解的比特级处理单元阵列组成，以适应每层 DNN 操作的位宽。
实现一种定制指令集架构（Fusion-ISA），包含循环指令和迭代语义，以减少指令体积并支持比特级控制。
集成编码与内存访问逻辑，以最小所需位宽存储和检索数据，降低片外与片上内存流量。
使用周期精确仿真与 45 nm 的 Verilog 综合，评估八种真实 DNN 的性能、面积与功耗。
在相同面积、频率与工艺技术条件下，将 Bit Fusion 与 Eyeriss 和 Stripes 进行对比，以隔离比特级可组合性的优势。
将设计扩展至 16 nm 工艺节点，通过功耗与性能指标评估其相对于高端 GPU（如 Titan Xp）的性能表现。

实验结果

研究问题

RQ1DNN 加速器中动态比特级融合是否能显著减少计算量与内存流量，同时不损失分类精度？
RQ2与固定位宽或仅支持二值化的加速器相比，比特级可组合性在性能与能效方面表现如何？
RQ3在 DNN 各层之间位宽变化的背景下，可多大程度上利用其特性以最小化硬件资源使用与数据移动？
RQ4在先进工艺节点（如 16 nm）下，比特级融合的性能与能效影响如何？
RQ5比特可调加速器是否能在保持超低功耗的同时，实现与高功耗 GPU 相当的性能？

主要发现

在 45 nm 工艺下，Bit Fusion 在相同面积、频率与工艺约束条件下，相比 Eyeriss 实现了 3.9 倍性能提升与 5.1 倍能效增益。
与 Stripes 对比，在 45 nm 工艺节点下面积与频率匹配时，Bit Fusion 实现了 2.6 倍性能提升与 3.9 倍能效降低。
在 16 nm 工艺节点下，Bit Fusion 的性能可媲美 250W 的 Titan Xp 显卡，同时功耗仅为 895 mW。
由于 DNN 中乘加运算占总操作的 99%以上，比特级计算量随位宽降低近乎呈平方级减少。
通过以最小所需位宽存储和访问数据，内存访问能耗成比例降低，从而有效提升了片上存储容量。
Fusion-ISA 实现了高效的软件控制比特级融合，减少了指令体积，最大化了并行性与数据局部性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。