QUICK REVIEW

[论文解读] S$^2$-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

Tan Yu, Li Xu|arXiv (Cornell University)|Aug 2, 2021

Advanced Image and Video Retrieval Techniques参考文献 28被引用 30

一句话总结

tldr: S2-MLPv2 通过扩展通道、用不同的空间移位对通道进行分割，并通过 split-attention 融合，提升 Spatial-Shift MLP，达到 ImageNet-1K 上 83.6% 的 Top-1，参数量为 55M（不使用额外数据）。”

ABSTRACT

Recently, MLP-based vision backbones emerge. MLP-based vision architectures with less inductive bias achieve competitive performance in image recognition compared with CNNs and vision Transformers. Among them, spatial-shift MLP (S$^2$-MLP), adopting the straightforward spatial-shift operation, achieves better performance than the pioneering works including MLP-mixer and ResMLP. More recently, using smaller patches with a pyramid structure, Vision Permutator (ViP) and Global Filter Network (GFNet) achieve better performance than S$^2$-MLP. In this paper, we improve the S$^2$-MLP vision backbone. We expand the feature map along the channel dimension and split the expanded feature map into several parts. We conduct different spatial-shift operations on split parts. Meanwhile, we exploit the split-attention operation to fuse these split parts. Moreover, like the counterparts, we adopt smaller-scale patches and use a pyramid structure for boosting the image recognition accuracy. We term the improved spatial-shift MLP vision backbone as S$^2$-MLPv2. Using 55M parameters, our medium-scale model, S$^2$-MLPv2-Medium achieves an $83.6\%$ top-1 accuracy on the ImageNet-1K benchmark using $224 imes 224$ images without self-attention and external training data.

研究动机与目标

推动以更少归纳偏差改进基于 MLP 的视觉主干。
引入 S2-MLPv2，通过通道扩展和 split-attention 提升跨补丁通信。
利用包含更小补丁的金字塔结构来提升识别精度。
展示在 ImageNet-1K 上介于中等规模的 MLP 模型中的最前沿性能，且无需额外数据。

提出的方法

使用 MLP 将 S2-MLP 块中的通道维从 c 扩展到 3c。
将扩展后的特征图分成三部分，对前两部分应用两种非对称的空间移位操作。
用 split-attention 机制融合三部分移位后的特征，得到输出特征。
引入具有较小补丁的两级金字塔结构，以提升细粒度建模。
在两块 S2-MLPv2 组成的框架中，与 CM-MLP（channel-mixing MLP）一起使用 S2-MLPv2 组件。

实验结果

研究问题

RQ1通过扩展通道并对不同移位的分支应用 split-attention，是否能提升相较原始 S2-MLP 的跨补丁通信？
RQ2采用更小补丁的金字塔结构是否能在 ImageNet-1K 上提升 S2-MLPv2 的精度且不依赖外部数据？

主要发现

模型	金字塔	参数 (M)	FLOPs (B)	训练大小	测试大小	Top-1 精度 (%)
S2-MLPv2-Small/7	✓	25	6.9	224	224	82.0
S2-MLPv2-Medium/7	✓	55	16.3	224	224	83.6

S2-MLPv2-Medium/7 在 ImageNet-1K（224x224）上达到 83.6% Top-1，参数量为 55M，FLOPs 为 16.3B。
S2-MLPv2-Small/7 在 25M 参数和 6.9B FLOPs 下达到 82.0% Top-1。
split-attention 融合优于简单的求和池化（Small/7 的 Top-1 为 82.0%，而 79.8%）。
具有较小补丁的两级金字塔结构相对于非金字塔的 Small/14 配置提升了性能（Small/7: 82.0% vs Small/14: 80.9%）。
相较于 CNNs 和视觉 Transformer，S2-MLPv2-Medium/7 在参数更少的情况下达到相当的准确率，与许多 Transformer 模型相比具有优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。