QUICK REVIEW

[论文解读] SegViT: Semantic Segmentation with Plain Vision Transformers

Bowen Zhang, Zhi Tian|arXiv (Cornell University)|Oct 12, 2022

Advanced Neural Network Applications被引用 75

一句话总结

SegViT 引入一个 Attention-to-Mask (ATM) 解码器，利用普通 Vision Transformer 进行语义分割，在通过一个 Shrunk 骨干网络设计降低计算的同时实现最先进或具竞争力的结果。

ABSTRACT

We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.

研究动机与目标

探索普通 Vision Transformer（ViTs）在语义分割中的潜力。
提出一个 Attention-to-Mask（ATM）模块，从注意力图中推导掩码。
在 ViT 的多层之间级联 ATM，以融合多层信息用于分割。
引入 Shrunk 骨干网络（基于查询的下采样和上采样）以降低计算。
在 ADE20K、COCO-Stuff-10K 和 PASCAL-Context 上展示最先进或具有竞争力的结果。

提出的方法

定义类别令牌查询，并对主干特征图使用交叉注意力，通过相似性映射的 Sigmoid 产生每个类别的掩码。
通过对更新后的类别令牌应用线性变换和 Softmax 来计算类别预测。
将来自多层 ViT 的 ATM 输出融合，形成最终分割预测。
引入 Shrunk，一种通过查询式下采样（QD）和查询式上采样（QU）实现计算节省的变体，可将 GFLOPs 降低多达约 40%。
使用多项损失进行训练：L_overall = L_cls + λ_focal L_IoU + λ_dice L_dice，对跨层的类别令牌和掩码进行监督。

实验结果

研究问题

RQ1是否可以使用普通 ViT 骨干通过基于注意力驱动的掩码推理方法来有效实现密集语义分割？
RQ2利用交叉注意力相似性映射作为掩码是否比在 ViT 特征上逐像素解码能够提升分割质量？
RQ3多层 ATM 级联和 Shrunk 骨干是否能够在不牺牲精度的前提下降低使用 ViTs 的分割计算量？

主要发现

SegViT with ATM 在 ADE20K（ViT-Large 骨干）上达到 55.2% mIoU，使用 Shrunk 时为 55.1%，在降低成本的同时实现具竞争力的性能。
在 ADE20K 上，SegViT with ViT-Large 超越了若干基于 ViT 的方法，在某些设置中接近或超过了最先进水平。
SegViT-Shrunk 将计算成本降低约40%（373.5 GFLOPs 对比 637.9 GFLOPs），损失仅为很小的性能下降。
多层 ATM 输入带来稳定的 mIoU 增益（例如，使用三层在 ADE20K 上最多提升 +1.7%）。
SegViT 在 Pascal-Context（60 类，65.3% mIoU）和 COCO-Stuff-10K（ViT-Large 下 50.3% mIoU）上展示出强劲结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。