QUICK REVIEW

[論文レビュー] Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Zifu Wan, Pingping Zhang|arXiv (Cornell University)|Apr 5, 2024

Natural Language Processing Techniques被引用数 8

ひとこと要約

Sigmaは、Siamese Visual State Space Model (Mamba) ベースのアーキテクチャを導入し、マルチモーダルセマンティックセグメンテーションのために線形計算量でグローバル受容野を実現し、RGBとXモダリティ（熱/深度）の効率的な融合を達成します。

ABSTRACT

Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at https://github.com/zifuwan/Sigma.

研究の動機と目的

追加モダリティ（熱画像と深度）を用いて、困難な条件下での頑健なセマンティックセグメンテーションを動機づける。
線形計算量でクロスモーダル融合を可能にするSiamese Mambaベースのアーキテクチャを提案する。
マルチモーダルセグメンテーションに適した融合メカニズムとチャネル認識デコーダを開発する。
RGB-thermalおよびRGB-Depthのベンチマークで最先端の精度と効率を実証する。

提案手法

RGBおよびXモダリティ入力からマルチスケールのグローバル特徴を抽出するため、4つのVisual State Space (VSS) Blocksを備えたSiameseエンコーダとダウンサンプリングを採用する。
クロスモーダル特徴の相互作用にはCross Mamba Block (CroMB)を、連結特徴の融合にはConcat Mamba Block (ConMB)とConcat SSを用いる。
チャネル認識を持つVisual State Space (CVSS)デコーダを実装してチャネル間情報を強化し、セグメンテーションのアップサンプリングを行う。
VSS Blocks内でSelective Scan 2D (SS2D)を活用して、線形計算量で長距離空間依存性をモデル化する。
Mambaの入力依存ダイナミクスを活用して、ConMB内で連結されたマルチモーダル系列を直接処理し、過度なパッチ化を避けて情報を保持する。

実験結果

リサーチクエスチョン

RQ1Siamese Mambaアーキテクチャは、RGBと熱画像または深度データをセマンティックセグメンテーションに効果的に融合できるか？
RQ2Transformerベースの融合と比較して、Mambaベースの融合アプローチは計算量を削減しつつ精度を維持または向上させるのか？
RQ3CroMBとConMBの融合モジュールがマルチモーダルセグメンテーションの性能に与える影響は何か？
RQ4チャネル認識デコーダはチャネル間情報のモデリングと最終的なセグメンテーション品質にどう寄与するか？

主な発見

SigmaはRGB-ThermalおよびRGB-Depthセグメンテーションベンチマークにおいて、精度と効率の点で最新モデルを上回る。
CroMBとConMBによるクロスモーダル融合は顕著な利得をもたらし、アブレーションではいずれかのブロックを除去すると性能が低下する。
提案されたCVSSデコーダはチャネルごとの情報捕捉を強化し、MLPやSwinベースのデコーダなどの代替手法よりセグメンテーション結果を向上させる。
SigmaはTransformerベースの融合手法と比較して、パラメータ数とFLOPsのプロファイルが好都合で（線形計算量）であることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。