QUICK REVIEW

[论文解读] Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

Yuheng Shi, Minjing Dong|arXiv (Cornell University)|May 23, 2024

Image Retrieval and Classification Techniques被引用 22

一句话总结

引入多尺度 VMamba（MSVMamba），结合多尺度 2D 扫描（MS2D）、分层 MS3 块和 ConvFFN，在参数有限的情况下提升长距离依赖学习；在基于 SSM 的骨干网络中，在 ImageNet、COCO 和 ADE20K 上实现最先进的结果。

ABSTRACT

Despite the significant achievements of Vision Transformers (ViTs) in various vision tasks, they are constrained by the quadratic complexity. Recently, State Space Models (SSMs) have garnered widespread attention due to their global receptive field and linear complexity with respect to the input length, demonstrating substantial potential across fields including natural language processing and computer vision. To improve the performance of SSMs in vision tasks, a multi-scan strategy is widely adopted, which leads to significant redundancy of SSMs. For a better trade-off between efficiency and performance, we analyze the underlying reasons behind the success of the multi-scan strategy, where long-range dependency plays an important role. Based on the analysis, we introduce Multi-Scale Vision Mamba (MSVMamba) to preserve the superiority of SSMs in vision tasks with limited parameters. It employs a multi-scale 2D scanning technique on both original and downsampled feature maps, which not only benefits long-range dependency learning but also reduces computational costs. Additionally, we integrate a Convolutional Feed-Forward Network (ConvFFN) to address the lack of channel mixing. Our experiments demonstrate that MSVMamba is highly competitive, with the MSVMamba-Tiny model achieving 82.8% top-1 accuracy on ImageNet, 46.9% box mAP, and 42.2% instance mAP with the Mask R-CNN framework, 1x training schedule on COCO, and 47.6% mIoU with single-scale testing on ADE20K.Code is available at \url{https://github.com/YuHengsss/MSVMamba}.

研究动机与目标

用 SSMs 解决参数受限视觉模型中的长距离遗忘问题。
开发分层多尺度扫描策略，以降低冗余并保持细粒度信息。
整合 ConvFFN，以增强基于 SSM 的骨干网中的通道混合与局部特征提取。

提出的方法

用多尺度状态空间（MS3）块替换 VMamba 中的 SS2D，包含 MS2D 扫描和 ConvFFN 通道混合器。
通过深度卷积（步幅为 1 和 s）创建多尺度特征图来开发 MS2D，将全分辨率与下采样后的特征图通过 S6 块处理并聚合结果。
在 MS2D 之后加入一个 Squeeze-Excitation（SE）块，并使用 ConvFFN（深度卷积 + 两个全连接层）来增强通道级信息交换。
通过控制嵌入维度和块数来确保可比的 FLOPs，使之能够与 LeViT 类预算进行公平比较。
提供 Nano、Micro 和 Tiny 模型变体，参数量为 6.9M–33.0M，FLOPs 为 0.9–4.6 GFLOPs，便于可扩展部署。

实验结果

研究问题

RQ1如何设计多尺度 2D 扫描以在基于 SSM 的视觉骨干网中减少冗余并提升长距离依赖学习？
RQ2在固定计算预算下，集成 ConvFFN 和 SE 块对跨通道信息交换和整体准确率有何影响？
RQ3在保持效率的前提下，分层多尺度的 VMamba 设计是否能在 ImageNet、COCO 和 ADE20K 上超越现有的 VMamba 变体及其他 SOTA 骨干网络？

主要发现

MSVMamba-T 在 ImageNet-1K 上以 33M 参数与 4.6 GFLOPs 达到 82.8% 的 Top-1 准确率。
MSVMamba-Nano、-Nano 基线，在大致相近的 FLOPs 下相对于 VMamba-Nano 在 Top-1 准确率上提升多达 5.5 个百分点。
MSVMamba-T 在显著更低的计算成本下比 VMamba-T 高 0.6% 的 Top-1 准确率。
在 COCO 目标检测中，MSVMamba-T 在 1x 调度下比 Swin-T 高出 +4.2 box AP 和 +2.9 mask AP。
在 ADE20K 语义分割中，MSVMamba-T 单尺度测试取得 47.6 mIoU（多尺度测试时为 48.5）。
结合 MS2D 和 SE 块的 ConvFFN 可带来显著的准确性提升（例如，带 SE 时 Top-1 提升约 +0.5%；在消融实验中 ConvFFN 提升约 +2.0%）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。