QUICK REVIEW

[論文レビュー] QKFormer: Hierarchical Spiking Transformer using Q-K Attention

Chenlin Zhou, Han Zhang|arXiv (Cornell University)|Mar 25, 2024

Advanced Memory and Neural Computing被引用数 6

ひとこと要約

QKFormer は、線形複雑性を持つスパイク形式の Q-K アテンションとパッチ埋め込みの Deformed Shortcut を導入し、階層的なスパイキングトランスフォーマーを実現。直接訓練可能で、ImageNet-1k などのデータセットで従来の SNN モデルを上回る性能を発揮する。

ABSTRACT

Spiking Transformers, which integrate Spiking Neural Networks (SNNs) with Transformer architectures, have attracted significant attention due to their potential for energy efficiency and high performance. However, existing models in this domain still suffer from suboptimal performance. We introduce several innovations to improve the performance: i) We propose a novel spike-form Q-K attention mechanism, tailored for SNNs, which efficiently models the importance of token or channel dimensions through binary vectors with linear complexity. ii) We incorporate the hierarchical structure, which significantly benefits the performance of both the brain and artificial neural networks, into spiking transformers to obtain multi-scale spiking representation. iii) We design a versatile and powerful patch embedding module with a deformed shortcut specifically for spiking transformers. Together, we develop QKFormer, a hierarchical spiking transformer based on Q-K attention with direct training. QKFormer shows significantly superior performance over existing state-of-the-art SNN models on various mainstream datasets. Notably, with comparable size to Spikformer (66.34 M, 74.81%), QKFormer (64.96 M) achieves a groundbreaking top-1 accuracy of 85.65% on ImageNet-1k, substantially outperforming Spikformer by 10.84%. To our best knowledge, this is the first time that directly training SNNs have exceeded 85% accuracy on ImageNet-1K. The code and models are publicly available at https://github.com/zhouchenlin2096/QKFormer

研究の動機と目的

視覚タスクの省エネで高性能なスパイキングニューラルネットワークを動機づける。
直接訓練可能な階層的スパイキングトランスフォーマーを開発する（非変換）。
線形複雑性を持つスパイク形式の Q-K アテンション機構を提案する。
情報流を改善する Deformed Shortcut を組み込んだ Patch Embedding モジュールを設計する。
ImageNet-1k および他データセットで最先端の結果を示す。

提案手法

スパーク形式の Query (Q) と Key (K) を用いてトークンとチャネルの重要度を線形複雑性で計算する Q-K アテンションを提案する。
Q-K アテンションを Q-K Token Attention (QKTA) と Q-K Channel Attention (QKCA) に分解し、スパイクベースのマスキングと単純な集計で実装する。
Patch Embedding モジュールに Deformed Shortcut (PEDS) を組み込み、パッチ埋め込みにおける残差学習を可能にする。
三つの段階から成る Hierarchical Spiking Transformer (QKFormer) を構築し、トークン数を段階的に減らしチャネル次元を増やす。
スパイキング領域で直接訓練し、Spiking MLP (SMLP) と最終的な FC 層を分類に用いる。
ImageNet-1k、CIFAR-10/100、CIFAR10-DVS、DVS128 Gesture を横断する広範な実験を行い、性能と効率を検証する。

実験結果

リサーチクエスチョン

RQ1スパイク形式の Q-K アテンションは、Spiking Neural Networks においてトークンとチャネルの重要性を効果的にモデル化しつつ、計算複雑性を線形に達成できるのか。
RQ2Q-K アテンションを組み込んだ階層的スパイキング構造は、大規模ビジョンタスクにおいて既存の SNN トランスフォーマーより性能を向上させるのか。
RQ3PEDS（Deformed Shortcut を備えた Patch Embedding モジュール）は、スパイキングトランスフォーマーにおける情報伝達と精度を向上させるのか。
RQ4直接訓練された SNN は ImageNet-1k において、適切なタイムステップ数で 85% の Top-1 精度を達成できるのか。

主な発見

Dataset	Methods	Architecture	Param (M)	Time Step	Top-1 Acc (%)
ImageNet-1K	HST-10-384 (QKFormer)	QKFormer	16.47	4	78.80
ImageNet-1K	HST-10-512 (QKFormer)	QKFormer	29.08	4	82.04
ImageNet-1K	HST-10-768 (QKFormer)	QKFormer	64.96	1	81.69
ImageNet-1K	HST-10-768 (QKFormer)	QKFormer	64.96	4	84.22
ImageNet-1K	HST-10-768 with ∗	QKFormer	64.96	4	85.25
ImageNet-1K	HST-10-768 with ∗∗	QKFormer	64.96	4	85.65

QKFormer は ImageNet-1k で 4 time steps、64.96M パラメータ（本論文で報告された中で最良）で 85.65% の Top-1 精度を達成。
QKFormer は同程度のモデルサイズで Spikformer を 10.84 ポイント上回る（85.65% vs 74.81%）。
直接訓練された SNN は本研究で初めて ImageNet-1k で 85% を超える Top-1 精度を達成。
Q-K アテンションは線形の時間/空間特性を持ち、QKTA および QKCA によりスパイク形状の操作で効率的なトークンおよびチャネル重み付けを実現。
PEDS は CIFAR100 および CIFAR10-DVS において、ベースライン Spikformer と比較して精度向上を示す（CIFAR100: +2.05%、CIFAR10-DVS: +1.3%）。
QKFormer は静的データセットとニューロモorphic データセットの両方で高い性能を示し、二次方の注意機構と比較して記憶とエネルギーの特性が有利である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。