QUICK REVIEW

[論文レビュー] MFil-Mamba: Multi-Filter Scanning for Spatial Redundancy-Aware Visual State Space Models

Puskal Khadka, KC Santosh|arXiv (Cornell University)|Mar 20, 2026

Advanced Neural Network Applications被引用数 0

ひとこと要約

MFil-Mambaは方向探索をマルチフィルタ走査バックボーンに置換し、視覚状態空間モデルで2D視覚データ処理の冗長性を削減しつつ、ImageNetおよびCOCO/ADE20Kで高性能を達成します。

ABSTRACT

State Space Models (SSMs), especially recent Mamba architecture, have achieved remarkable success in sequence modeling tasks. However, extending SSMs to computer vision remains challenging due to the non-sequential structure of visual data and its complex 2D spatial dependencies. Although several early studies have explored adapting selective SSMs for vision applications, most approaches primarily depend on employing various traversal strategies over the same input. This introduces redundancy and distorts the intricate spatial relationships within images. To address these challenges, we propose MFil-Mamba, a novel visual state space architecture built on a multi-filter scanning backbone. Unlike fixed multi-directional traversal methods, our design enables each scan to capture unique and contextually relevant spatial information while minimizing redundancy. Furthermore, we incorporate an adaptive weighting mechanism to effectively fuse outputs from multiple scans in addition to architectural enhancements. MFil-Mamba achieves superior performance over existing state-of-the-art models across various benchmarks that include image classification, object detection, instance segmentation, and semantic segmentation. For example, our tiny variant attains 83.2% top-1 accuracy on ImageNet-1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on the ADE20K dataset. Code and models are available at https://github.com/puskal-khadka/MFil-Mamba.

研究の動機と目的

固定方向スキャンによる冗長性と歪みを、2D視覚データへの状態空間モデル適用時に解消する。
事前定義された走査経路を持たない補完的な空間手掛かりを抽出するマルチフィルタ走査バックボーンを導入する。
複数の走査出力を効果的に結合する適応的融合を組み込む。
画像分類、物体検出、インスタンス分割、セマンティック分割の多様なタスクでの性能を実証する。

提案手法

固定方向走査を複数の空間フィルタを入力特徴マップに適用するマルチフィルタ戦略に置換する。
水平/垂直ソーベルベースのフィルタと学習可能な動的フィルタを用いて四分の表現を形成する。
フィルタリングされた表現を結合し、選択的状態空間モジュール（MFil-SSM）を通して処理する。
異なる走査出力を学習可能な重みで結合する適応的融合機構を採用する。
従来のMLPをConvFFNに置換し局所特徴処理を強化する。
Tiny、Small、Baseの3つのモデルバリアントを、詳細なアーキテクチャ構成と共に提供する。

Figure 1: Top-1 Validation Accuracy versus Model Parameters comparison on Imagenet-1k [ 11 ] datasets. MFil-Mamba demonstrates superior performance compared to baseline state-of-the-art models with similar parameter counts.

実験結果

リサーチクエスチョン

RQ1マルチフィルタ走査は、明示的な走査順序を課すことなく2D画像の多様な空間依存を捉えられるのか？
RQ2マルチフィルタ出力の適応的融合は表現品質と下流タスクの性能を改善するのか？
RQ3MFil-MambaのバリアントはImageNet分類、MS COCO検出/分割、ADE20K分割で競争力あるまたは最先端の結果を達成するのか？
RQ4Tiny/Small/Baseのアーキテクチャ選択は視覚ベンチマークにおける精度・パラメータ・FLOPsのバランスにどう影響するのか？

主な発見

MFil-Mamba-TはImageNet-1KでTop-1精度83.2%を達成し、同程度のサイズと複雑さのいくつかのベースラインを上回る。
MFil-Mamba-SはImageNet-1KでTop-1精度83.9%を達成。
MFil-Mamba-BはImageNet-1KでTop-1精度84.2%を達成。
MS COCOで1xスケジュールの場合、MFil-Mamba-Tは47.3 AP（ボックス）/46.0 AP（マスク）、MFil-Mamba-Sは47.9 AP/46.4 AP、MFil-Mamba-Bは49.0 AP/47.6 AP。
SOCO/分割タスクにおいて、MFil-Mambaのバリアントは報告されたベンチマークで競争力あるいは優れた性能を示す。
Grad-CAMと受容野分析を通じた解釈可能な洞察を提供し、空間的特徴統合の有効性を支持する。

Figure 2: (Top) Overview of the MFil-Mamba. (Bottom Left) Illustration of Single MFil-Mamba Block. (Bottom Middle) Illustration of MFil-SSM block with filter-based scanning across four input representations. Each representation is independently filtered and then its patches are concatenated and pass

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。