[論文レビュー] SSA-CNN: Semantic Self-Attention CNN for Pedestrian Detection
SSA-CNNはマルチスケールのセマンティックセグメンテーションマップを自己注意の手掛かりとしてCNN特徴と統合し、歩行者検出を改善します。Caltechで効率的な推論を実現し、MRで最先端を達成します。
Pedestrian detection plays an important role in many applications such as autonomous driving. We propose a method that explores semantic segmentation results as self-attention cues to significantly improve the pedestrian detection performance. Specifically, a multi-task network is designed to jointly learn semantic segmentation and pedestrian detection from image datasets with weak box-wise annotations. The semantic segmentation feature maps are concatenated with corresponding convolution features maps to provide more discriminative features for pedestrian detection and pedestrian classification. By jointly learning segmentation and detection, our proposed pedestrian self-attention mechanism can effectively identify pedestrian regions and suppress backgrounds. In addition, we propose to incorporate semantic attention information from multi-scale layers into deep convolution neural network to boost pedestrian detection. Experiment results show that the proposed method achieves the best detection performance with MR of 6.27% on Caltech dataset and obtain competitive performance on CityPersons dataset while maintaining high computational efficiency.
研究の動機と目的
- Motivate pedestrian detection improvements by leveraging semantic segmentation as self-attention cues.
- Propose a multi-scale, multi-task framework to jointly learn pedestrian detection and semantic segmentation with box-wise annotations.
- Integrate semantic features into RPN and R-CNN stages to improve discrimination and localization of pedestrians.
提案手法
- Extend Faster R-CNN with Semantic Self-Attention RPN (SSA-RPN) and Semantic Self-Attention R-CNN (SSA-RCNN).
- Attach semantic segmentation branches to conv4_3 and conv5_3 to produce conv4_3_seg and conv5_3_seg feature maps.
- Concatenate semantic feature maps with corresponding convolution features to form augmented detection/classification features.
- Use multi-scale semantic information by pooling and combining segmentation maps from conv4_3 and conv5_3 for self-attention in R-CNN.
- Train with a multi-task loss that jointly optimizes detection and segmentation branches (binary pedestrian vs. non-pedestrian).
- Evaluate on Caltech and CityPersons with single-image inference on a GTX 1080 Ti.
実験結果
リサーチクエスチョン
- RQ1Does incorporating multi-scale semantic segmentation as self-attention improve pedestrian detection performance?
- RQ2Can joint learning of detection and segmentation using box-wise annotations reduce annotation burden while boosting accuracy?
- RQ3How does multi-scale semantic self-attention affect RPN proposals and R-CNN classification in pedestrian detection?
- RQ4What is the method’s runtime efficiency compared to state-of-the-art approaches?
主な発見
| 手法 | MR(%) | 実行時間 | GPU |
|---|---|---|---|
| CompACT-Deep | 11.75 | 1s | K40 |
| MS-CNN | 9.95 | 0.4s | Titan X |
| SA-FastRCNN | 9.68 | 0.59s | Titan X |
| RPN+BF | 9.58 | 0.6s | K40 |
| F-DNN* | 8.65 | 0.3s | Titan X |
| GDFL | 7.85 | 0.05s | Titan X |
| F-DNN2+SS* | 7.67 | 2.48s | Titan X |
| SDS-RCNN | 7.36 | 0.21s | Titan X |
| SSA-CNN | 6.27 | 0.11s | 1080 Ti |
- SSA-CNN achieves MR of 6.27% on Caltech test set under Reasonable setting, outperforming prior methods.
- Demonstrates competitive results on CityPersons while maintaining high computational efficiency.
- Multi-scale semantic self-attention improves both proposal quality (SSA-RPN) and classification (SSA-RCNN) compared to single-scale or no-attention baselines.
- Using box-wise annotations for semantic guidance reduces annotation requirements relative to pixel-wise segmentation.
- SSA-RPN–SSA-RCNN integration yields faster or comparable runtimes versus contemporaries like SDS-RCNN and F-DNN2+SS.
- Ablation studies show deeper conv5_3 semantic maps provide stronger attention cues and that multi-scale fusion yields best performance.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。