[论文解读] LocalMamba: Visual State Space Model with Windowed Selective Scan
简短总结:LocalMamba 引入基于窗口的局部扫描和逐层扫描方向搜索,用于 Vision Mamba,在 ImageNet、COCO 和 ADE20K 上实现更优结果,同时保持高效计算。
Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.
研究动机与目标
- Motivate the use of local 2D-aware scanning to preserve local dependencies in vision state space models.
- Propose a windowed local scan mechanism to better capture local structures within images while maintaining global context.
- Introduce a learnable per-layer scan-direction search to optimize scanning patterns across network depth.
- Develop plain and hierarchical model variants (LocalVim and LocalVMamba) to validate scalability and effectiveness.
- Demonstrate improvements over Vim, VMamba, CNNs, and ViTs across classification, detection, and segmentation tasks.
提出的方法
- Introduce LocalMamba with a four-branch local scan block that processes input features in parallel local windows and aggregates via a Spatial-Channel Attention (SCAttn).
- Extend scan directions with eight candidates (horizontal, vertical, and 2x2/7x7 local windows, each in standard and flipped forms) and use a differentiable search (à la DARTS) to select four directions per layer.
- Use continuous relaxation to combine multiple SSMs per layer during training and select the top four directions for inference.
- Provide two architecture variants: LocalVim (plain) and LocalVMamba (hierarchical), replacing Vim/VMamba blocks with LocalMamba blocks.
- Report results across ImageNet classification, COCO object detection/segmentation, and ADE20K semantic segmentation to illustrate gains.
![Figure 1 : Illustration of scan methods. (a) and (b): Previous methods Vim [ 60 ] and VMamba [ 32 ] traverse the entire row or column axis, resulting in significant distances for capturing dependencies between neighboring pixels within the same semantic region ( e.g. , the left eye in the image). (c](https://ar5iv.labs.arxiv.org/html/2403.09338/assets/x1.png)
实验结果
研究问题
- RQ1Can a windowed local scanning strategy improve the preservation of local 2D dependencies in Vision Mamba models without sacrificing global context?
- RQ2Does per-layer search for optimal scan directions yield measurable gains over fixed or single-direction scans?
- RQ3How do plain and hierarchical LocalMamba variants perform relative to Vim, VMamba, CNNs, and ViTs across classification, detection, and segmentation tasks?
主要发现
| Method | Image size | Params (M) | FLOPs (G) | Top-1 ACC (%) |
|---|---|---|---|---|
| LocalVim-T | 224^2 | 8 | 1.5 | 76.2 |
| LocalVim-S | 224^2 | 28 | 4.8 | 81.2 |
| VMamba-T | 224^2 | 22 | 5.6 | 82.2 |
| VMamba-S | 224^2 | 44 | 11.2 | 83.5 |
| LocalVMamba-T | 224^2 | 26 | 5.7 | 82.7 |
| LocalVMamba-S | 224^2 | 50 | 11.4 | 83.7 |
- On ImageNet-1K, LocalVim-T achieves 76.2% Top-1 accuracy with 1.5G FLOPs, surpassing DeiT-Ti (72.2%).
- Hierarchical LocalVMamba-T reaches 82.7% accuracy, outperforming Swin-T by 1.4%.
- LocalVim-S and LocalVMamba-S show strong gains over Vim and VMamba baselines in classification.
- For COCO object detection, LocalVMamba-T achieves 46.7 APb and 42.2 APm, outperforming Swin-T by margins.
- On ADE20K segmentation, LocalVim-S achieves 46.4 mIoU (SS) and LocalVMamba-S achieves 50.0 mIoU (SS) / 51.0 mIoU (MS) in various setups.
- Ablations show local scan improves Vim-T by 1.0% and SCAttn adds ~0.6% gain in ImageNet.
![Figure 2 : By extending the original scan with our local scan mechanism, our method significantly improves the ImageNet accuracies of Vim [ 60 ] while keeping similar FLOPs.](https://ar5iv.labs.arxiv.org/html/2403.09338/assets/x2.png)
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。