QUICK REVIEW

[论文解读] SoftSNN: Low-Cost Fault Tolerance for Spiking Neural Network Accelerators under Soft Errors

Rachmad Vidya Wicaksana Putra, Muhammad Abdullah Hanif|arXiv (Cornell University)|Mar 10, 2022

Advanced Memory and Neural Computing被引用 3

一句话总结

SoftSNN 提出了一种针对脉冲神经网络（SNN）加速器的低开销容错方法，可在不重新执行的情况下缓解权重寄存器和神经元中的软错误。通过分析软错误下 SNN 的行为，采用边界约束与保护技术来限制权重并保护神经元，并集成轻量级硬件增强，SoftSNN 即使在高故障率下也能将精度下降控制在 3% 以内，同时相比基于重执行的方法，延迟降低最多 3 倍，能耗降低最多 2.3 倍。

ABSTRACT

Specialized hardware accelerators have been designed and employed to maximize the performance efficiency of Spiking Neural Networks (SNNs). However, such accelerators are vulnerable to transient faults (i.e., soft errors), which occur due to high-energy particle strikes, and manifest as bit flips at the hardware layer. These errors can change the weight values and neuron operations in the compute engine of SNN accelerators, thereby leading to incorrect outputs and accuracy degradation. However, the impact of soft errors in the compute engine and the respective mitigation techniques have not been thoroughly studied yet for SNNs. A potential solution is employing redundant executions (re-execution) for ensuring correct outputs, but it leads to huge latency and energy overheads. Toward this, we propose SoftSNN, a novel methodology to mitigate soft errors in the weight registers (synapses) and neurons of SNN accelerators without re-execution, thereby maintaining the accuracy with low latency and energy overheads. Our SoftSNN methodology employs the following key steps: (1) analyzing the SNN characteristics under soft errors to identify faulty weights and neuron operations, which are required for recognizing faulty SNN behavior; (2) a Bound-and-Protect technique that leverages this analysis to improve the SNN fault tolerance by bounding the weight values and protecting the neurons from faulty operations; and (3) devising lightweight hardware enhancements for the neural hardware accelerator to efficiently support the proposed technique. The experimental results show that, for a 900-neuron network with even a high fault rate, our SoftSNN maintains the accuracy degradation below 3%, while reducing latency and energy by up to 3x and 2.3x respectively, as compared to the re-execution technique.

研究动机与目标

为解决 SNN 加速器中软错误带来的关键可靠性挑战，这些软错误可能污染权重和神经元操作，导致精度下降。
开发一种容错机制，避免冗余重执行技术带来的高延迟和高能耗开销。
在软错误下保持高推理精度，同时最小化硬件和性能开销，以适用于物联网和边缘计算等实时、能效受限的应用。
设计一种轻量级、硬件高效的解决方案，专门针对 SNN 的数字实现，区别于模拟故障模型或通用纠错码/双模冗余（ECC/DMR）方法。

提出的方法

分析软错误下 SNN 的特性，识别导致输出错误的故障权重和神经元操作。
提出一种边界约束与保护（BnP）技术，通过硬件感知设计将权重值限制在安全范围内，并利用硬件手段保护神经元免受故障操作影响。
在 SNN 计算引擎中引入轻量级硬件增强，包括在突触和神经元中加入硬化组件，以在不重执行的情况下纠正被污染的位。
设计一种容错架构，在寄存器级和电路级集成权重边界约束与神经元保护，重点聚焦于计算引擎的本地内存和神经元逻辑。
针对 8 位精度、256×256 突触和 256 个神经元的数字 SNN 加速器进行实现优化，目标是实现实时、能效高效的部署。
通过故障注入和与基线方法及重执行技术的对比评估，在多种网络规模和故障率下验证该方法的有效性。

实验结果

研究问题

RQ1SNN 加速器中——特别是权重寄存器和神经元操作中的软错误——如何影响推理精度和系统可靠性？
RQ2是否可以在不采用冗余重执行的情况下实现软错误缓解，从而降低 SNN 加速器的延迟和能耗开销？
RQ3在软错误下，哪些关键的 SNN 特性可被利用来设计一种轻量级、高效的容错机制？
RQ4如何在硬件中高效实现权重边界约束与神经元保护，同时保持最小的面积和性能开销？
RQ5所提出的边界约束与保护技术在高故障率下能将精度保持在何种程度，同时相比基于重执行的缓解方法具有优势？

主要发现

对于一个包含 900 个神经元的 SNN，SoftSNN 即使在高故障率下也能将精度下降控制在 3% 以内，显著优于基线方法和重执行方法。
与基于重执行的缓解技术相比，SoftSNN 将延迟降低最多 3 倍，能耗降低最多 2.3 倍。
BnP 增强型计算引擎的面积开销仅为 14%（BnP1）和 18%（BnP2/BnP3），体现了提升可靠性所付出的低成本代价。
所提出的硬件增强措施的能耗消耗低于基线 SNN 的 1.6 倍，而重执行方法的能耗开销高达 3 倍。
该方法在多种网络规模（N400 至 N3600）和工作负载（MNIST 和 Fashion MNIST）下均表现有效，展现出一致的性能和精度提升。
结果表明，避免冗余重执行可实现可靠的 SNN 执行，且性能和能耗成本极低，适用于实时和能效受限的应用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。