QUICK REVIEW

[논문 리뷰] Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Zifu Wan, Pingping Zhang|arXiv (Cornell University)|2024. 04. 05.

Natural Language Processing Techniques인용 수 8

한 줄 요약

시그마는 다중 모달 의미 분할을 위한 Siamese Visual State Space Model(Mamba) 기반 아키텍처를 도입하여 선형 복잡도로 전역 수용장을 달성하고 RGB와 X-모달(열/깊이) 간의 효율적 융합을 구현한다.

ABSTRACT

Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at https://github.com/zifuwan/Sigma.

연구 동기 및 목표

추가 모달(열 및 깊이)을 활용하여 까다로운 조건에서도 견고한 의미 분할을 촉진한다.
선형 복잡도로 교차 모달 융합을 가능하게 하는 Siamese Mamba 기반 아키텍처를 제안한다.
다중 모달 분할에 맞춘 융합 메커니즘과 채널 인식 디코더를 개발한다.
RGB-열 및 RGB-깊이 벤치마크에서 최첨단 정확도와 효율성을 입증한다.

제안 방법

RGB 및 X-모달 입력으로 다중 스케일 글로벌 특징을 추출하기 위해 네 개의 Visual State Space(VSS) 블록을 갖는 Siamese 인코더와 다운샘플링을 채택한다.
교차 모달 특징 상호작용을 위한 Cross Mamba Block(CroMB)와 연결된 특징을 융합하는 Concat Mamba Block(ConMB) 및 Concat SS를 사용한다.
채널 인식 Visual State Space(CVSS) 디코더를 구현하여 채널 간 정보를 강화하고 분할을 위해 업샘플링한다.
VSS 블록 내에서 Selective Scan 2D(SS2D)를 활용하여 선형 복잡도로 장거리 공간 의존성을 모델링한다.
ConMB에서 연결된 다중 모달 시퀀스를 직접 처리하여 정보를 보존하고 과도한 패치 없이 운영되며, 이는 Mamba의 입력 의존 역학에 의해 보조된다.

실험 결과

연구 질문

RQ1Siamese Mamba 아키텍처가 RGB와 열 또는 깊이 데이터를 의미 분할에 효과적으로 융합할 수 있는가?
RQ2Transformer 기반 융합과 비교하여 Mamba 기반 융합 방식이 계산 복잡도를 줄이면서 정확도를 유지하거나 향상시키는가?
RQ3CroMB 및 ConMB 융합 모듈이 다중 모달 분할 성능에 미치는 영향은 무엇인가?
RQ4채널 인식 디코더가 채널 간 정보 모델링 및 최종 분할 품질에 어떻게 기여하는가?

주요 결과

시그마는 정확도와 효율성 측면에서 RGB-열 및 RGB-깊이 분할 벤치마크에서 최첨단 모델을 능가한다.
CroMB와 ConMB를 이용한 교차 모달 융합은 주목할 만한 이점을 제공하며, 비형상 실험에서 어느 블록을 제거하면 성능 저하가 나타난다.
제안된 CVSS 디코더는 채널별 정보 포착을 강화하여 MLP 또는 Swin 기반 디코더와 같은 대안들보다 분할 성능을 향상시킨다.
시그마는 Transformer 기반 융합 방법에 비해 매개변수 및 FLOPs 측면에서 선형 복잡도로 유리한 특성을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.