QUICK REVIEW

[论文解读] PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

Chenhongyi Yang, Zehui Chen|arXiv (Cornell University)|Mar 26, 2024

Advanced Image and Video Retrieval Techniques被引用 25

一句话总结

PlainMamba 是一种简单的非分层状态空间模型，用于视觉识别，采用连续二维扫描和方向感知更新，能够高效处理图像，并在较低复杂度下实现与分层模型相竞争的结果。

ABSTRACT

We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks, achieving performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at: https://github.com/ChenhongyiYang/PlainMamba .

研究动机与目标

受到 Mamba 启发的简单、非分层视觉编码器，用于广泛的视觉任务。
将选择性扫描适应于二维图像数据，以保持空间连续性。
引入连续二维扫描和方向感知更新，以编码空间关系。
提供可扩展的 PlainMamba 变体，保持恒定宽度并避免 CLS 标记。
在 ImageNet 分类、COCO 检测和 ADE20K 分割等任务上展示具有竞争力的性能。

提出的方法

重新审视状态空间建模（SSM）和 Mamba 方法，以实现输入相关的状态更新。
引入卷积分词器以从图像中生成视觉标记。
堆叠相同的 PlainMamba 块以保持恒定宽度并避免 CLS 标记。
开发连续二维扫描，确保在扫描过程中标记在二维空间中的相邻性。
添加方向感知更新，将二维相对位置信息注入到选择性扫描中。
定义三种 PlainMamba 变体（L1、L2、L3），具有递增的深度/宽度，并报告 FLOPs 和参数量。

实验结果

研究问题

RQ1在没有 CLS 标记或分层多尺度结构的情况下，基于非分层 SSM 的编码器在标准视觉任务上的表现如何？
RQ2连续二维扫描和方向感知更新能否提升基于 SSM 的视觉模型的二维空间学习？
RQ3PlainMamba 变体在分类、检测和分割方面与非分层 SSM、Transformer 及分层模型相比有何差异？

主要发现

模型	分层架构	参数量	FLOPs	Top-1
PlainMamba-L1	No	7.3M	3.0G	77.9
PlainMamba-L2	No	25M	8.1G	81.6
PlainMamba-L3	No	50M	14.4G	82.3

PlainMamba-L2 与 PlainMamba-L3 在 ImageNet-1K 上的 Top-1 精确度与非分层 SSM 和 Transformer 相当，并在相似尺寸下接近分层模型。
PlainMamba 在相似参数预算下优于以往非分层 SSM（如 Vision Mamba、Mamba-ND）。
PlainMamba 在语义分割（ADE20K）和目标检测（COCO）方面达到或超过非分层基线，同时在某些配置中使用更少的参数和更低的 FLOPs。
消融研究表明，深层模型在宽度和深度之间保持平衡通常会提高准确率，但在达到一定深度和宽度后收益递减。
与基于 CLS 标记或分层方法相比，PlainMamba 提供一个更简单、可扩展的骨干网络，性能具有竞争力，且更易于跨模态集成。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。