QUICK REVIEW

[Paper Review] Spike-driven Transformer

Man Yao, Jiakui Hu|arXiv (Cornell University)|Jul 4, 2023

Advanced Memory and Neural Computing35 citations

TL;DR

Introduces a Spike-driven Self-Attention module and spike-focused residuals to convert Transformer operations into sparse additions, achieving energy-efficient, linear-complexity self-attention with competitive accuracy on ImageNet and neuromorphic datasets.

ABSTRACT

Spiking Neural Networks (SNNs) provide an energy-efficient deep learning option due to their unique spike-based event-driven (i.e., spike-driven) paradigm. In this paper, we incorporate the spike-driven paradigm into Transformer by the proposed Spike-driven Transformer with four unique properties: 1) Event-driven, no calculation is triggered when the input of Transformer is zero; 2) Binary spike communication, all matrix multiplications associated with the spike matrix can be transformed into sparse additions; 3) Self-attention with linear complexity at both token and channel dimensions; 4) The operations between spike-form Query, Key, and Value are mask and addition. Together, there are only sparse addition operations in the Spike-driven Transformer. To this end, we design a novel Spike-Driven Self-Attention (SDSA), which exploits only mask and addition operations without any multiplication, and thus having up to $87.2 imes$ lower computation energy than vanilla self-attention. Especially in SDSA, the matrix multiplication between Query, Key, and Value is designed as the mask operation. In addition, we rearrange all residual connections in the vanilla Transformer before the activation functions to ensure that all neurons transmit binary spike signals. It is shown that the Spike-driven Transformer can achieve 77.1\% top-1 accuracy on ImageNet-1K, which is the state-of-the-art result in the SNN field. The source code is available at https://github.com/BICLab/Spike-Driven-Transformer.

Motivation & Objective

Motivate energy-efficient deep learning by marrying Spike Neural Networks (SNNs) with Transformer architectures.
Design a fully spike-driven Transformer where key operations are performed via sparse additions and binary spikes.
Rearrange residual connections to ensure binary spike communication throughout the network.
Demonstrate the energy efficiency and competitive accuracy of the proposed model on static and neuromorphic datasets.

Proposed method

Develop Spike-driven Self-Attention (SDSA) that uses only mask and sparse addition, avoiding multiplications and softmax.
Replace Q, K, V multiplications with Hadamard masks and column-wise summations followed by a spike neuron layer, achieving linear complexity in tokens and channels.
Rearrange residual connections to propagate binary spike signals and avoid multi-bit spike outputs.
Process image inputs through Spiking Patch Splitting, SDSA, MLP, and a linear classifier with a spike-enabled pipeline.
Provide theoretical energy analysis showing large energy savings for self-attention and overall spike-driven components.

Experimental results

Research questions

RQ1Can Spike-driven Self-Attention (SDSA) replace traditional self-attention without sacrificing accuracy?
RQ2What are the energy and computation benefits of a fully spike-driven Transformer compared to vanilla Transformer and existing spiking Transformers?
RQ3How do spike-driven residual connections affect network dynamics and task performance?
RQ4What is the performance of the Spike-driven Transformer on ImageNet and neuromorphic datasets compared to state-of-the-art SNNs?
RQ5Is the SDSA approach scalable in terms of token and channel dimensions?

Key findings

Spike-driven Transformer achieves 77.1% top-1 on ImageNet-1K under 288x288 input, D=768, L=8, reporting state-of-the-art in the SNN field.
SDSA reduces self-attention energy by up to 87.2x compared to vanilla self-attention by replacing multiplications and softmax with mask and addition operations.
Energy analysis shows Spike-driven self-attention energy is dramatically lower than ANN self-attention across model sizes (e.g., 8-768 case with 87.2x gap).
Residual connections redesigned as membrane potential shortcuts keep spike signals binary and improve performance versus SEW-based shortcuts.
The approach yields state-of-the-art or competitive results on static and neuromorphic datasets, including CIFAR-10/100, CIFAR10-DVS, and DVS128 Gesture.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.