[论文解读] SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition
SSTFormer 将 RGB 图像帧和原始事件流通过一个带瓶颈融合模块的混合 Spiking CNN 与 Memory Support Transformer 进行融合,并推出 PokerEvent 数据集以推进 RGB-Event 识别。
Event camera-based pattern recognition is a newly arising research topic in recent years. Current researchers usually transform the event streams into images, graphs, or voxels, and adopt deep neural networks for event-based classification. Although good performance can be achieved on simple event recognition datasets, however, their results may be still limited due to the following two issues. Firstly, they adopt spatial sparse event streams for recognition only, which may fail to capture the color and detailed texture information well. Secondly, they adopt either Spiking Neural Networks (SNN) for energy-efficient recognition with suboptimal results, or Artificial Neural Networks (ANN) for energy-intensive, high-performance recognition. However, seldom of them consider achieving a balance between these two aspects. In this paper, we formally propose to recognize patterns by fusing RGB frames and event streams simultaneously and propose a new RGB frame-event recognition framework to address the aforementioned issues. The proposed method contains four main modules, i.e., memory support Transformer network for RGB frame encoding, spiking neural network for raw event stream encoding, multi-modal bottleneck fusion module for RGB-Event feature aggregation, and prediction head. Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset which contains 114 classes, and 27102 frame-event pairs recorded using a DVS346 event camera. Extensive experiments on two RGB-Event based classification datasets fully validated the effectiveness of our proposed framework. We hope this work will boost the development of pattern recognition by fusing RGB frames and event streams. Both our dataset and source code of this work will be released at https://github.com/Event-AHU/SSTFormer
研究动机与目标
- 通过将 RGB 帧与事件流融合来解决单模态基于事件的识别性能有限的问题。
- 通过将脉冲神经网络与基于 Transformer 的时序建模相结合,开发能效高且准确的识别方法。
- 提出一个大规模的 RGB-Event 数据集(PokerEvent),以实现对帧-事件识别模型的稳健评估。
- 引入多模态瓶颈融合机制,以有效整合 RGB 与事件特征用于分类。
提出的方法
- 直接使用 Spiking Neural Network (SNN) 编码器对原始事件流进行编码,并配以 ANN 解码器,以在能耗和准确性之间取得平衡。
- 通过基于剪辑的支持-查询跨注意力,使用 Memory Support Transformer (MST) 捕捉 RGB 帧的时空信息。
- 通过具有可变形卷积的多模态瓶颈融合(MBF)模块对 RGB 与事件特征进行融合,实现交互式学习。
- 可选的双 Transformer 变体将 SpikingFormer 与 MST 结合,以提升 RGB-Event 识别。
- 采用交叉熵损失和 16 步 SNN 仿真进行训练,以与视频长度输入对齐。
实验结果
研究问题
- RQ1是否可以有效地融合 RGB 帧和原始事件流,以提升超出单一模态的帧-事件识别?
- RQ2将原始事件流的 Spiking Neural Network 编码器与用于 RGB 帧的 Memory Support Transformer 相结合,是否能实现有利的准确性-能耗权衡?
- RQ3MBF 融合策略对多模态识别性能的影响是什么?
- RQ4所提框架能否推广到为实际帧-事件识别任务设计的大规模 RGB-Event 数据集?
主要发现
- 提出的 SCNN-MST 融合(RGB-Event)在 PokerEvent 上优于单模态基线,消融实验中达到 top-1 53.19% 和 top-5 53.80%。
- 双 Transformer 变体(SpikingFormer-MST)在 PokerEvent 上得到 top-1 54.74%,在 HARDVS 上的 top-5 60.17%,展示了将脉冲学习与 Transformer 范式结合的进一步收益。
- MBF 融合稳步提升性能,消融研究中 PokerEvent 的 top-1 提升至 53.80%(包含 MBF),HARDVS 的 top-1 提高至 49.40%。
- 在 HARDVS 上,单独的 RGB MST 达到 top-1 48.17%,而单独的 SCNN 在基于事件的识别上达到 top-1 49.02%,验证了模态之间的互补优势。
- PokerEvent 的融合结果与若干 RGB- 和 Transformer 基线竞争,说明 RGB-Event 融合在实际识别任务中的可行性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。