QUICK REVIEW

[论文解读] Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Zifu Wan, Pingping Zhang|arXiv (Cornell University)|Apr 5, 2024

Natural Language Processing Techniques被引用 8

一句话总结

Sigma 引入以 Siamese Visual State Space Model（Mamba）为基础的多模态语义分割架构，实现全局感受野，复杂度线性，并高效融合 RGB 与 X 模态（热成像/深度）。

ABSTRACT

Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at https://github.com/zifuwan/Sigma.

研究动机与目标

在充满挑战的条件下，利用额外模态（热/深度）来推动鲁棒的语义分割。
提出一种基于 Siamese Mamba 的架构，实现跨模态以线性复杂度进行融合。
开发适用于多模态分割的融合机制和通道感知解码器。
在 RGB-热和 RGB-深度基准测试上展示最先进的准确性与效率。

提出的方法

采用带有四个 Visual State Space（VSS）块的 Siamese 编码器并进行下采样，以从 RGB 和 X 模态输入中提取多尺度全局特征。
使用 Cross Mamba Block（CroMB）进行跨模态特征互动，并使用 Concat Mamba Block（ConMB）及 Concat SS 来融合级联特征。
实现一个通道感知的 Visual State Space（CVSS）解码器，以增强通道间信息并进行上采样以实现分割。
在 VSS 块内利用 Selective Scan 2D（SS2D）来以线性复杂度建模长距离空间依赖。
直接在 ConMB 中处理级联的多模态序列，以保留信息而不进行大幅打乱分块，得益于 Mamba 的输入相关动态。

实验结果

研究问题

RQ1Siamese Mamba 架构是否能有效融合 RGB 与热成像或深度数据以实现语义分割？
RQ2与基于变换器的融合相比，基于 Mamba 的融合方法是否在保持或提高准确性的同时降低了计算复杂度？
RQ3CroMB 和 ConMB 融合模块对多模态分割性能的影响是什么？
RQ4通道感知解码器如何促进通道间信息建模和最终分割质量？

主要发现

Sigma 在 RGB-热和 RGB-深度分割基准上在准确性和效率方面优于现有模型。
CroMB 与 ConMB 的跨模态融合带来显著提升，消融实验显示若移除任一块，性能下降。
提出的 CVSS 解码器提升了通道级信息捕获，相较于如 MLP 或 Swin 基解码器的替代方案，分割结果更好。
相比于基于变换器的融合方法，Sigma 在参数和 FLOPs 上表现良好（线性复杂度）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。