QUICK REVIEW

[论文解读] SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation

Meng-Hao Guo, Cheng-Ze Lu|arXiv (Cornell University)|Sep 18, 2022

Advanced Neural Network Applications被引用 483

一句话总结

SegNeXt 在卷积注意力机制（MSCA）中引入了一个基于 CNN 的编码器用于语义分割，在主要基准上以比基于 transformer 的方法更低的计算成本达到最先进性能。

ABSTRACT

We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent transformer-based models have dominated the field of semantic segmentation due to the efficiency of self-attention in encoding spatial information. In this paper, we show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers. By re-examining the characteristics owned by successful segmentation models, we discover several key components leading to the performance improvement of segmentation models. This motivates us to design a novel convolutional attention network that uses cheap convolutional operations. Without bells and whistles, our SegNeXt significantly improves the performance of previous state-of-the-art methods on popular benchmarks, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID. Notably, SegNeXt outperforms EfficientNet-L2 w/ NAS-FPN and achieves 90.6% mIoU on the Pascal VOC 2012 test leaderboard using only 1/10 parameters of it. On average, SegNeXt achieves about 2.0% mIoU improvements compared to the state-of-the-art methods on the ADE20K datasets with the same or fewer computations. Code is available at https://github.com/uyzhang/JSeg (Jittor) and https://github.com/Visual-Attention-Network/SegNeXt (Pytorch).

研究动机与目标

Identify key properties that successful semantic segmentation models share.
Propose a convolutional attention mechanism that is cheap to compute yet effective for context modeling.
Design an encoder-decoder architecture that leverages multi-scale convolutional features for spatial attention.
Demonstrate that convolutional attention can outperform transformer-based methods on standard benchmarks.
Show favorable performance-computation trade-offs across diverse datasets.

提出的方法

Introduce MSCAN as the encoder with a multi-scale convolutional attention (MSCA) module.
MSCA aggregates local context via depth-wise convolutions and multi-branch large-kernel features, then reweights channels through a 1x1 convolution to generate attention.
Use element-wise multiplication Att ⊗ F to apply attention to input features F.
Adopt a lightweight Hamburger module to capture global context in the decoder by aggregating multi-level features.
Employ an encoder-decoder design with four MSCAN stages and a decoder that uses multi-scale features from the last three stages.
Train and evaluate on ImageNet for pretraining and ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID for segmentation.

实验结果

研究问题

RQ1Does a carefully designed convolutional attention mechanism with multi-scale receptive fields match or exceed transformer-based self-attention for semantic segmentation?
RQ2Can a CNN-based encoder with MSCA achieve favorable accuracy-C FLOPs trade-offs on high-resolution segmentation tasks?
RQ3How does the proposed SegNeXt decoder (Hamburger-based global context) affect segmentation performance compared to other decoders?
RQ4What is the impact of multi-scale convolutions and channel-wise attention on segmentation benchmarks across diverse datasets?

主要发现

SegNeXt-S achieves strong results with around 13.9M parameters and significantly lower FLOPs than some transformer-based rivals on ADE20K and Cityscapes.
SegNeXt-B and SegNeXt-L show substantial mIoU gains over several state-of-the-art methods while maintaining lower or comparable computation (e.g., SegNeXt-S outperforms SegFormer-B2 with less computation).
MSCA’s multi-branch large-kernel convolutions plus an attention weighting mechanism yield better segmentation performance than single large-kernel or non-attention designs.
The Hamburger decoder provides a favorable accuracy-to-computation balance and outperforms several attention-based decoders in experiments.
On Pascal VOC, SegNeXt-L reaches 90.6% mIoU under certain pretraining and settings, and on Cityscapes real-time evaluation SegNeXt-T achieves 25 FPS without special acceleration.
Across ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID, SegNeXt consistently improves over recent transformer-based and CNN-based methods.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。