QUICK REVIEW

[论文解读] SPFormer: Enhancing Vision Transformer with Superpixel Representation

Jieru Mei, Liang-Chieh Chen|arXiv (Cornell University)|Jan 5, 2024

CCD and CMOS Imaging Sensors被引用 6

一句话总结

SPFormer 将可学习的超像素表示与 Vision Transformer 通过超像素跨注意力结合，提升准确性、效率与可解释性，并在 ImageNet 上取得强劲提升及更鲁棒的分割性能。

ABSTRACT

In this work, we introduce SPFormer, a novel Vision Transformer enhanced by superpixel representation. Addressing the limitations of traditional Vision Transformers' fixed-size, non-adaptive patch partitioning, SPFormer employs superpixels that adapt to the image's content. This approach divides the image into irregular, semantically coherent regions, effectively capturing intricate details and applicable at both initial and intermediate feature levels. SPFormer, trainable end-to-end, exhibits superior performance across various benchmarks. Notably, it exhibits significant improvements on the challenging ImageNet benchmark, achieving a 1.4% increase over DeiT-T and 1.1% over DeiT-S respectively. A standout feature of SPFormer is its inherent explainability. The superpixel structure offers a window into the model's internal processes, providing valuable insights that enhance the model's interpretability. This level of clarity significantly improves SPFormer's robustness, particularly in challenging scenarios such as image rotations and occlusions, demonstrating its adaptability and resilience.

研究动机与目标

通过利用自适应超像素来保持局部细节并实现高效全局建模，推动像素表示与补丁表示的桥接。
开发一个可训练的、端到端的 SPFormer 架构，将超像素表示与 ViT 通过跨注意力整合。
证明基于超像素的表示在 ImageNet 上提升准确性，并增强对旋转与遮挡的可解释性与鲁棒性。
在图像分类和语义分割任务上评估 SPFormer，以展示其多样性与高效性。

提出的方法

引入将像素聚合为语义连贯区域的超像素表示，并在像素与相邻超像素之间建立一个关联矩阵 A。
提出 Superpixel Cross Attention (SCA)，具有两种跨注意力方向：Pixel-to-Superpixel (P2S) 与 Superpixel-to-Pixel (S2P)，在 t 次迭代中迭代性地 refinement S 与 A。
加入 Convolution Position Embedding (CPE) 将位置信息注入像素与超像素特征。
采用双分支 SPFormer 架构，其中高分辨率稠密像素分支由低分辨率超像素分支实现高效性。
使用多头 SCA 生成多个语义丰富的超像素表示，随后进行 MHSA 全局上下文建模，并在各阶段通过 1x1 卷积逐步传播上下文。

实验结果

研究问题

RQ1自适应、可学习的超像素表示结合跨注意力是否能在 ImageNet 与分割任务上优于固定补丁的 ViT？
RQ2SCA 模块是否改善超像素与语义边界之间的对齐，从而在未见数据上具有更好的泛化能力？
RQ3与传统 ViT 相比，SPFormer 在效率、对旋转和遮挡的鲁棒性方面表现如何？
RQ4高分辨率像素分支在保留细节的同时，超像素分支如何实现全局上下文？

主要发现

SPFormer 相较 DeiT 基线在 ImageNet 有提升，例如 SPFormer-S/32 † 达到 77.9% Top-1，参数 22M、FLOPs 1.3G，胜过 DeiT-S/32 与 DeiT-T。
在 ImageNet 上，SPFormer-S/56 达到 72.3% Top-1，参数 22M、FLOPs 0.5G，而 SPFormer-S/32 与 SPFormer-S/32 † 的 Top-1 分别为 76.4% 与 77.9%。
SPFormer-B 与 SPFormer-S 变体的 Top-1 最高可达 82.7%（87M 参数、19.2G FLOPs），超越 DeiT-B（81.8%）与 DeiT-S（79.9%）。
学习得到的超像素关联在未在分割数据集上训练的情况下也能与图像边界对齐，实现对 COCO 及部分/对象分割任务的零-shot 转移。
消融研究显示多轮 SCA、多头注意力以及在 SCA 层的位置布局对性能提升至关重要。
SPFormer 在 ADE20K 和 Pascal Context 的分割 mIoU 有提升，尤其在 ImageNet 预训练模型下（分别提升高达 +4.2% 与 +2.8%），从零训练也有提升（+3.0% 与 +3.1%）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。