[论文解读] SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation
SegNeXt 在卷积注意力机制(MSCA)中引入了一个基于 CNN 的编码器用于语义分割,在主要基准上以比基于 transformer 的方法更低的计算成本达到最先进性能。
We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent transformer-based models have dominated the field of semantic segmentation due to the efficiency of self-attention in encoding spatial information. In this paper, we show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers. By re-examining the characteristics owned by successful segmentation models, we discover several key components leading to the performance improvement of segmentation models. This motivates us to design a novel convolutional attention network that uses cheap convolutional operations. Without bells and whistles, our SegNeXt significantly improves the performance of previous state-of-the-art methods on popular benchmarks, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID. Notably, SegNeXt outperforms EfficientNet-L2 w/ NAS-FPN and achieves 90.6% mIoU on the Pascal VOC 2012 test leaderboard using only 1/10 parameters of it. On average, SegNeXt achieves about 2.0% mIoU improvements compared to the state-of-the-art methods on the ADE20K datasets with the same or fewer computations. Code is available at https://github.com/uyzhang/JSeg (Jittor) and https://github.com/Visual-Attention-Network/SegNeXt (Pytorch).
研究动机与目标
- Identify key properties that successful semantic segmentation models share.
- Propose a convolutional attention mechanism that is cheap to compute yet effective for context modeling.
- Design an encoder-decoder architecture that leverages multi-scale convolutional features for spatial attention.
- Demonstrate that convolutional attention can outperform transformer-based methods on standard benchmarks.
- Show favorable performance-computation trade-offs across diverse datasets.
提出的方法
- Introduce MSCAN as the encoder with a multi-scale convolutional attention (MSCA) module.
- MSCA aggregates local context via depth-wise convolutions and multi-branch large-kernel features, then reweights channels through a 1x1 convolution to generate attention.
- Use element-wise multiplication Att ⊗ F to apply attention to input features F.
- Adopt a lightweight Hamburger module to capture global context in the decoder by aggregating multi-level features.
- Employ an encoder-decoder design with four MSCAN stages and a decoder that uses multi-scale features from the last three stages.
- Train and evaluate on ImageNet for pretraining and ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID for segmentation.
实验结果
研究问题
- RQ1Does a carefully designed convolutional attention mechanism with multi-scale receptive fields match or exceed transformer-based self-attention for semantic segmentation?
- RQ2Can a CNN-based encoder with MSCA achieve favorable accuracy-C FLOPs trade-offs on high-resolution segmentation tasks?
- RQ3How does the proposed SegNeXt decoder (Hamburger-based global context) affect segmentation performance compared to other decoders?
- RQ4What is the impact of multi-scale convolutions and channel-wise attention on segmentation benchmarks across diverse datasets?
主要发现
- SegNeXt-S achieves strong results with around 13.9M parameters and significantly lower FLOPs than some transformer-based rivals on ADE20K and Cityscapes.
- SegNeXt-B and SegNeXt-L show substantial mIoU gains over several state-of-the-art methods while maintaining lower or comparable computation (e.g., SegNeXt-S outperforms SegFormer-B2 with less computation).
- MSCA’s multi-branch large-kernel convolutions plus an attention weighting mechanism yield better segmentation performance than single large-kernel or non-attention designs.
- The Hamburger decoder provides a favorable accuracy-to-computation balance and outperforms several attention-based decoders in experiments.
- On Pascal VOC, SegNeXt-L reaches 90.6% mIoU under certain pretraining and settings, and on Cityscapes real-time evaluation SegNeXt-T achieves 25 FPS without special acceleration.
- Across ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID, SegNeXt consistently improves over recent transformer-based and CNN-based methods.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。