[論文レビュー] PP-MobileSeg: Explore the Fast and Accurate Semantic Segmentation Model on Mobile Devices
PP-MobileSegはStrideFormer、Aggregated Attention Module (AAM)、およびValid Interpolate Module (VIM)を導入し、ARMデバイス上で精度・モデルサイズ・レイテンシのバランスを取りつつ、モバイルセマンティックセグメンテーションの最先端を達成します。
The success of transformers in computer vision has led to several attempts to adapt them for mobile devices, but their performance remains unsatisfactory in some real-world applications. To address this issue, we propose PP-MobileSeg, a semantic segmentation model that achieves state-of-the-art performance on mobile devices. PP-MobileSeg comprises three novel parts: the StrideFormer backbone, the Aggregated Attention Module (AAM), and the Valid Interpolate Module (VIM). The four-stage StrideFormer backbone is built with MV3 blocks and strided SEA attention, and it is able to extract rich semantic and detailed features with minimal parameter overhead. The AAM first filters the detailed features through semantic feature ensemble voting and then combines them with semantic features to enhance the semantic information. Furthermore, we proposed VIM to upsample the downsampled feature to the resolution of the input image. It significantly reduces model latency by only interpolating classes present in the final prediction, which is the most significant contributor to overall model latency. Extensive experiments show that PP-MobileSeg achieves a superior tradeoff between accuracy, model size, and latency compared to other methods. On the ADE20K dataset, PP-MobileSeg achieves 1.57% higher accuracy in mIoU than SeaFormer-Base with 32.9% fewer parameters and 42.3% faster acceleration on Qualcomm Snapdragon 855. Source codes are available at https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.8.
研究の動機と目的
- resource-constrained mobile devices.
- Design a mobile-friendly architecture that balances model size, latency, and accuracy.
- Develop modules that efficiently fuse features and reduce inference latency on mobile hardware.
提案手法
- Four-stage StrideFormer backbone using MobileNetV3 blocks with strided SEA attention applied to the last two stages.
- Aggregation of features via the Aggregated Attention Module (AAM) that ensembles semantic-rich features for improved detail fusion.
- Valid Interpolate Module (VIM) to replace final interpolation and ArgMax by upsampling only the channels corresponding to predicted classes.
- Two architecture variants (PP-MobileSeg-Base and PP-MobileSeg-Tiny) with different channel widths and SEA heads to suit varying complexity needs.
- Training uses ImageNet1K pretraining, cross-entropy and Lovasz losses with a 4:1 ratio, AdamW optimizer, and data augmentation aligned with prior mobile segmentation work.
- Inference latency profiling on Qualcomm Snapdragon 855 with PaddleLite, with VIM enabled for large-class datasets.
実験結果
リサーチクエスチョン
- RQ1Can a hybrid CNN-transformer backbone achieve superior mobile segmentation accuracy with limited parameters?
- RQ2How much latency can be reduced on mobile devices by replacing the final interpolation/ArgMax with a class-aware upsampling module?
- RQ3Do the proposed StrideFormer, AAM, and VIM modules synergistically improve accuracy and latency on ADE20K and Cityscapes datasets?
主な発見
| Model | Backbone | mIoU (%) | Latency (ms) | Parameters (M) |
|---|---|---|---|---|
| SeaFormer-Small | SeaFormer-Small | 70.70 | 204.9 | 1.61 |
| PP-MobileSeg-Tiny | StrideFormer-Tiny | 70.82 | 158.3 | 1.44 |
| SeaFormer-Base | SeaFormer-Base | 72.20 | 297.3 | 8.64 |
| PP-MobileSeg-Base | StrideFormer-Base | 74.14 | 323.7 | 5.71 |
| TopFormer-Tiny | TopTransFormer-Tiny | 32.46 | 490.3 | 1.41 |
| LR-ASPP | MobileNetV3-large-x1 | 33.10 | 730.9 | 3.20 |
| MobileSeg | MobileNetV3-large-x1 | 33.26 | 391.5 | 2.85 |
| TopFormer-Base | TopTransformer-Base | 37.80 | 480.6 | 5.13 |
| SeaFormer-Base | Seaformer-Base | 40.20 | 465.4 | 8.64 |
| PP-MobileSeg-Tiny | StrideFormer-Tiny | 36.39 | 215.3 | 1.44 |
| PP-MobileSeg-Base | StrideFormer-Base | 41.57 | 265.5 | 5.71 |
- PP-MobileSeg-Tiny achieves 36.39 mIoU with 215.3 ms latency and 1.44 M parameters on ADE20K, outperforming several lightweight baselines.
- PP-MobileSeg-Base achieves 41.57 mIoU with 265.5 ms latency and 5.71 M parameters on ADE20K, offering competitive accuracy with smaller size than SeaFormer-Base.
- On Cityscapes, PP-MobileSeg-Tiny reaches 70.82 mIoU with 158.3 ms latency and 1.44 M parameters, surpassing SeaFormer-Small and matching mobile efficiency expectations.
- PP-MobileSeg-Base on Cityscapes attains 74.14 mIoU with 323.7 ms latency and 5.71 M parameters, delivering higher accuracy with reasonable latency.
- Ablation studies show VIM reduces latency by about 49.5% and that both ensemble voting and final semantics in AAM contribute to accuracy gains.
- StrideFormer four-stage backbone reduces parameter overhead by ~32.19% and improves accuracy by ~0.78% compared to alternatives.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。