QUICK REVIEW

[논문 리뷰] MFil-Mamba: Multi-Filter Scanning for Spatial Redundancy-Aware Visual State Space Models

Puskal Khadka, KC Santosh|arXiv (Cornell University)|2026. 03. 20.

Advanced Neural Network Applications인용 수 0

한 줄 요약

MFil-Mamba는 방향성 탐색을 다중 필터 스캐닝 백본으로 대체하여 시각 상태 공간 모델을 구성하며, 2D 시각 데이터 처리에서 중복성을 줄이고 ImageNet 및 COCO/ADE20K에서 강한 결과를 달성한다.

ABSTRACT

State Space Models (SSMs), especially recent Mamba architecture, have achieved remarkable success in sequence modeling tasks. However, extending SSMs to computer vision remains challenging due to the non-sequential structure of visual data and its complex 2D spatial dependencies. Although several early studies have explored adapting selective SSMs for vision applications, most approaches primarily depend on employing various traversal strategies over the same input. This introduces redundancy and distorts the intricate spatial relationships within images. To address these challenges, we propose MFil-Mamba, a novel visual state space architecture built on a multi-filter scanning backbone. Unlike fixed multi-directional traversal methods, our design enables each scan to capture unique and contextually relevant spatial information while minimizing redundancy. Furthermore, we incorporate an adaptive weighting mechanism to effectively fuse outputs from multiple scans in addition to architectural enhancements. MFil-Mamba achieves superior performance over existing state-of-the-art models across various benchmarks that include image classification, object detection, instance segmentation, and semantic segmentation. For example, our tiny variant attains 83.2% top-1 accuracy on ImageNet-1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on the ADE20K dataset. Code and models are available at https://github.com/puskal-khadka/MFil-Mamba.

연구 동기 및 목표

2D 시각 데이터에 상태 공간 모델을 적용할 때 고정된 방향 스캔으로 인한 중복성과 왜곡을 해결한다.
미리 정의된 탐색 경로 없이 보완적인 공간 신호를 추출하는 다중 필터 스캐닝 백본을 도입한다.
여러 스캔의 출력을 효과적으로 결합하기 위해 적응적 융합을 도입한다.
이미지 분류, 객체 탐지, 인스턴스 분할, 시맨틱 분할 등 다양한 작업에서 다재다능한 성능을 입증한다.

제안 방법

입력 피처 맵에 여러 공간 필터를 적용하는 다중 필터 전략으로 고정된 방향성 탐색을 대체한다.
수평/수직 Sobel 기반 필터와 학습 가능한 동적 필터를 사용하여 네 배 표현을 형성한다.
필터링된 표현을 연결하고 선택적 상태 공간 모듈(MFil-SSM)을 통해 처리한다.
훈련 가능한 가중치를 갖는 적응적 융합 메커니즘을 사용하여 서로 다른 스캔의 출력을 융합한다.
전통적 MLP를 지역 특징 처리 향상을 위해 ConvFFN으로 대체한다.
자세한 아키텍처 구성과 함께 세 가지 모델 변형(Tiny, Small, Base)을 제공한다.

Figure 1: Top-1 Validation Accuracy versus Model Parameters comparison on Imagenet-1k [ 11 ] datasets. MFil-Mamba demonstrates superior performance compared to baseline state-of-the-art models with similar parameter counts.

실험 결과

연구 질문

RQ1명시적 탐색 순서를 부여하지 않고도 다중 필터 스캔이 2D 이미지에서 다양한 공간 의존성을 포착할 수 있는가?
RQ2다중 필터 출력의 적응적 융합이 표현 품질과 다운스트림 작업 성능을 향상시키는가?
RQ3MFil-Mamba 계열이 ImageNet 분류, MS COCO 탐지/분할, ADE20K 분할에서 경쟁력 있는 또는 최첨단 성과를 달성하는가?
RQ4아키텍처 선택(Tiny/Small/Base)이 비전 벤치마크에서 정확도, 파라미터 및 FLOPs의 균형을 어떻게 맞추는가?

주요 결과

MFil-Mamba-T는 ImageNet-1K에서 Top-1 정확도 83.2%를 달성하여 유사 크기와 복잡도의 여러 베이스라인을 능가한다.
MFil-Mamba-S는 ImageNet-1K에서 Top-1 정확도 83.9%를 달성한다.
MFil-Mamba-B는 ImageNet-1K에서 Top-1 정확도 84.2%를 달성한다.
MS COCO에서 1x 스케줄 기준, MFil-Mamba-T는 47.3 AP box 및 46.0 AP mask를 달성; MFil-Mamba-S는 47.9 AP box 및 46.4 AP mask를 달성; MFil-Mamba-B는 49.0 AP box 및 47.6 AP mask를 달성한다.
SOCO/분할 작업에서 MFil-Mamba 계열은 보고된 벤치마크 전반에 걸쳐 동시대 백본과 비교해 경쟁력 있는 또는 우수한 성능을 보인다.
해당 아키텍처는 Grad-CAM 및 수용 영역(receptive-field) 분석을 통한 해석 가능한 통찰을 제공하여 효과적인 공간 특징 융합을 지원한다.

Figure 2: (Top) Overview of the MFil-Mamba. (Bottom Left) Illustration of Single MFil-Mamba Block. (Bottom Middle) Illustration of MFil-SSM block with filter-based scanning across four input representations. Each representation is independently filtered and then its patches are concatenated and pass

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.