[論文レビュー] SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation
SegNeXtは、CNNベースのエンコーダ内に convolutional attention mechanism (MSCA) を導入し、セマンティックセグメンテーションを実現。主要なベンチマークで、トランスフォーマーベースの手法より計算量を抑えつつ、最先端の性能を達成します。
We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent transformer-based models have dominated the field of semantic segmentation due to the efficiency of self-attention in encoding spatial information. In this paper, we show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers. By re-examining the characteristics owned by successful segmentation models, we discover several key components leading to the performance improvement of segmentation models. This motivates us to design a novel convolutional attention network that uses cheap convolutional operations. Without bells and whistles, our SegNeXt significantly improves the performance of previous state-of-the-art methods on popular benchmarks, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID. Notably, SegNeXt outperforms EfficientNet-L2 w/ NAS-FPN and achieves 90.6% mIoU on the Pascal VOC 2012 test leaderboard using only 1/10 parameters of it. On average, SegNeXt achieves about 2.0% mIoU improvements compared to the state-of-the-art methods on the ADE20K datasets with the same or fewer computations. Code is available at https://github.com/uyzhang/JSeg (Jittor) and https://github.com/Visual-Attention-Network/SegNeXt (Pytorch).
研究の動機と目的
- Identify key properties that successful semantic segmentation models share.
- Propose a convolutional attention mechanism that is cheap to compute yet effective for context modeling.
- Design an encoder-decoder architecture that leverages multi-scale convolutional features for spatial attention.
- Demonstrate that convolutional attention can outperform transformer-based methods on standard benchmarks.
- Show favorable performance-computation trade-offs across diverse datasets.
提案手法
- Introduce MSCAN as the encoder with a multi-scale convolutional attention (MSCA) module.
- MSCA aggregates local context via depth-wise convolutions and multi-branch large-kernel features, then reweights channels through a 1x1 convolution to generate attention.
- Use element-wise multiplication Att ⊗ F to apply attention to input features F.
- Adopt a lightweight Hamburger module to capture global context in the decoder by aggregating multi-level features.
- Employ an encoder-decoder design with four MSCAN stages and a decoder that uses multi-scale features from the last three stages.
- Train and evaluate on ImageNet for pretraining and ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID for segmentation.
実験結果
リサーチクエスチョン
- RQ1Does a carefully designed convolutional attention mechanism with multi-scale receptive fields match or exceed transformer-based self-attention for semantic segmentation?
- RQ2Can a CNN-based encoder with MSCA achieve favorable accuracy-C FLOPs trade-offs on high-resolution segmentation tasks?
- RQ3How does the proposed SegNeXt decoder (Hamburger-based global context) affect segmentation performance compared to other decoders?
- RQ4What is the impact of multi-scale convolutions and channel-wise attention on segmentation benchmarks across diverse datasets?
主な発見
- SegNeXt-S achieves strong results with around 13.9M parameters and significantly lower FLOPs than some transformer-based rivals on ADE20K and Cityscapes.
- SegNeXt-B and SegNeXt-L show substantial mIoU gains over several state-of-the-art methods while maintaining lower or comparable computation (e.g., SegNeXt-S outperforms SegFormer-B2 with less computation).
- MSCA’s multi-branch large-kernel convolutions plus an attention weighting mechanism yield better segmentation performance than single large-kernel or non-attention designs.
- The Hamburger decoder provides a favorable accuracy-to-computation balance and outperforms several attention-based decoders in experiments.
- On Pascal VOC, SegNeXt-L reaches 90.6% mIoU under certain pretraining and settings, and on Cityscapes real-time evaluation SegNeXt-T achieves 25 FPS without special acceleration.
- Across ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID, SegNeXt consistently improves over recent transformer-based and CNN-based methods.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。