QUICK REVIEW

[论文解读] MambaOut: Do We Really Need Mamba for Vision?

Weihao Yu, Xinchao Wang|arXiv (Cornell University)|May 13, 2024

African history and culture studies被引用 34

一句话总结

MambaOut 从 Mamba 块中移除状态空间模型（SSM），堆叠门控 CNN 块以展示 SSM 对 ImageNet 图像分类并非必要，同时暗示 SSM 可能有助于长期序列视觉任务，如检测和分割。

ABSTRACT

Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks. The code is available at https://github.com/yuweihao/MambaOut

研究动机与目标

评估 Mamba 的状态空间模型（SSM）是否对视觉识别任务必需。
在 ImageNet 分类任务中，评估无 SSM 的 MambaOut 相对于视觉 Mamba 模型的性能。
考察 SSM 对长期序列视觉任务（如目标检测和语义分割）的潜在益处。

提出的方法

通过在类似 ResNet 的四阶段层次中堆叠门控 CNN 块（无 SSM）来构建 MambaOut。
在门控 CNN 块中，用一个简单的深度可分离卷积（depthwise Conv）基的 token mixer 替换 Mamba 的基于 SSM 的 token mixer。
在 ImageNet 上使用 DeiT 风格的数据增强和 AdamW 优化进行训练，以与视觉 Mamba 模型进行对比。
在 COCO 上对检测/分割进行评估，Backbone 采用 Mask R-CNN。
在 ADE20K 上进行语义分割评估，Backbone 使用 UperNet。

实验结果

研究问题

RQ1在使用类似 Mamba 的架构时，SSM 对 ImageNet 图像分类是否必要？
RQ2一个更简单的没有 SSM 的门控 CNN/块是否能在 ImageNet 分类上超越视觉 Mamba 模型？
RQ3移除 SSM 会否降低对长期序列视觉任务（如目标检测和语义分割）的性能？
RQ4是否有证据表明 Mamba 的优势仅限于视觉中的长期序列或自回归任务？

主要发现

MambaOut 在没有 SSM 的情况下，在 ImageNet 的多种尺寸上持续超越视觉 Mamba 模型。
在相似 MACs 下，MambaOut 的 top-1 准确率高于 LocalVMamba-S 及其他视觉 Mamba 变体。
在 COCO 和 ADE20K 上，MambaOut 的性能未达到最新的视觉 Mamba 模型，普遍落后于最佳的卷积-注意力混合模型，表明 SSM 可能仍有助于长期序列视觉任务。
总体而言，MambaOut 支持 SSM 对图像分类非必要的假设，但也提示 SSM 在检测和分割任务中可能带来潜在益处。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。