QUICK REVIEW

[論文レビュー] SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition

Xiao Wang, Rong Yao|arXiv (Cornell University)|Aug 8, 2023

Advanced Memory and Neural Computing被引用数 9

ひとこと要約

SSTFormer は RGB フレームと生イベントストリームをハイブリッドな Spiking CNN と Memory Support Transformer を用いて統合し、ボトルネック融合モジュールを備えた、RGB-Event 認識を進展させるために PokerEvent データセットを導入する。

ABSTRACT

Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may be still limited due to the following two issues. Firstly, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Secondly, they adopt either Spiking Neural Networks (SNN) for energy-efficient recognition with suboptimal results, or Artificial Neural Networks (ANN) for energy-intensive, high-performance recognition. However, seldom of them consider achieving a balance between these two aspects. In this paper, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multi-modal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-Event based classification datasets fully validated the effectiveness of our proposed framework. We hope this work will boost the development of pattern recognition by fusing RGB frames and event streams. Both our dataset and source code of this work will be released at https://github.com/Event-AHU/SSTFormer

研究の動機と目的

RGB フレームとイベントストリームを統合して、単一モダリティのイベントベース認識の性能を向上させる。
スパイクニューラルネットワークとトランスフォーマーに基づく時系列モデリングを組み合わせることで、エネルギー効率が高く正確な認識を開発する。
フレーム-イベント認識モデルの堅牢な評価を可能にするための大規模なRGB-Eventデータセット（PokerEvent）を提案する。
分類のためにRGBとイベント特徴を効果的に統合する多モーダルボトルネック融合機構を導入する。

提案手法

エネルギーと精度のバランスを取るために、生イベントストリームを Spiking Neural Network (SNN) エンコーダで直接エンコードし、ANN デコーダと組み合わせる。
RGBフレームから空間-時間情報を、クリップベースのサポート-クエリクロスアテンションを介して捉えるために Memory Support Transformer (MST) を用いる。
変形可能な畳み込みを用いた Multi-modal Bottleneck Fusion (MBF) モジュールを介して、RGBとイベント特徴を融合し、相互作用的な学習を実現する。
オプションのデュアル-Transformer バリアントは、SpikingFormer と MST を組み合わせて、RGB-Event 認識を強化する。
クロスエントロピー損失と16ステップの SNN シミュレーションで、ビデオ長の入力に合わせて訓練する。

実験結果

リサーチクエスチョン

RQ1RGBフレームと生イベントストリームを効果的に融合して、単一モダリティを超えるフレーム-イベント認識を改善できるか？
RQ2生イベントストリームの SNN エンコーダとRGBフレームの MST を組み合わせると、適切な精度とエネルギーのトレードオフを達成できるか？
RQ3MBF 融合戦略が多モーダル認識の性能に与える影響は何か？
RQ4提案されたフレームワークは、実用的なフレーム-イベント認識タスクのために設計された大規模なRGB-Eventデータセットに一般化できるか？

主な発見

提案されたSCNN-MST融合（RGB-Event）は、PokerEventにおいて単一モダリティのベースラインより認識を向上させ、アブレーションで top-1 53.19% および top-5 53.80% を達成します。
デュアル-Transformer バリアント（SpikingFormer-MST）は PokerEvent で top-1 54.74%、 HARDVS で top-5 60.17% を達成し、スパイキングとトランスフォーマーのパラダイムを組み合わせることでさらなる利得を示します。
MBF 融合は一貫して性能を向上させ、PokerEvent の top-1 は 53.80%（MBF を含む）、HARDVS の top-1 は 49.40% に上昇するアブレーション研究。
HARDVS では、RGB MST 単独で top-1 48.17%、SCNN 単独で Event ベース認識の top-1 が 49.02% となり、モダリティの補完的な強みを検証します。
融合を用いた PokerEvent の結果は、いくつかの RGB およびトランスフォーマー基盤のベースラインと競合し、実用的な認識タスクにおける RGB-Event 融合の妥当性を示しています。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。