[論文レビュー] Dual Path Multi-Scale Fusion Networks with Attention for Crowd Counting
SFANet はデュアルパスの多スケール融合ネットワークと注意機構を用いて、多様な密度にわたる高分解能の密度マップと正確な群衆数を生成します。VGG16-bn バックボーンを用い、2つの統合経路(密度マップ経路と注意マップ経路)をエンドツーエンドで訓練します。
The task of crowd counting in varying density scenes is an extremely difficult challenge due to large scale variations. In this paper, we propose a novel dual path multi-scale fusion network architecture with attention mechanism named SFANet that can perform accurate count estimation as well as present high-resolution density maps for highly congested crowd scenes. The proposed SFANet contains two main components: a VGG backbone convolutional neural network (CNN) as the front-end feature map extractor and a dual path multi-scale fusion networks as the back-end to generate density map. These dual path multi-scale fusion networks have the same structure, one path is responsible for generating attention map by highlighting crowd regions in images, the other path is responsible for fusing multi-scale features as well as attention map to generate the final high-quality high-resolution density maps. SFANet can be easily trained in an end-to-end way by dual path joint training. We have evaluated our method on four crowd counting datasets (ShanghaiTech, UCF CC 50, UCSD and UCF-QRNF). The results demonstrate that with attention mechanism and multi-scale feature fusion, the proposed SFANet achieves the best performance on all these datasets and generates better quality density maps compared with other state-of-the-art approaches.
研究の動機と目的
- Address large head scale variations and background noise in crowd counting.
- Leverage multi-scale feature fusion to generate high-resolution density maps.
- Incorporate an attention pathway to highlight head regions and suppress background.
- Propose a multi-task loss combining Euclidean density loss and attention guidance.
- Demonstrate superior performance on standard crowd counting benchmarks.
提案手法
- Use VGG16-bn backbone to extract multi-scale features (conv2-2, conv3-3, conv4-3, conv5-3).
- Construct a density map path (DMP) via feature pyramid fusion to produce high-resolution density maps.
- Construct an attention map path (AMP) with the same structure to learn head-region probabilities.
- Fuse DMP features with an attention map via element-wise multiplication to refine density features.
- Train with a multi-task loss: L = L_density + alpha * L_attention (alpha = 0.1).
- Generate ground-truth density maps by Gaussian-blurring head annotations; derive attention-ground-truth from density maps.
実験結果
リサーチクエスチョン
- RQ1Can a dual-path, multi-scale fusion network improve robustness to scale variation and background noise in crowd counting?
- RQ2Does integrating an attention map path improve localization of head regions and density map quality?
- RQ3Does the proposed multi-task loss accelerate convergence and boost counting accuracy?
主な発見
| Dataset | Part | MAE | MSE |
|---|---|---|---|
| ShanghaiTech | Part A | 59.8 | 99.3 |
| ShanghaiTech | Part B | 6.9 | 10.9 |
| UCF_CC_50 | Full set | 219.6 | 316.2 |
| UCF-QRNF | Full set | 100.8 | 174.5 |
| UCSD | Full set | 0.82 | 1.07 |
- SFANet achieves state-of-the-art or competitive MAE/MSE across ShanghaiTech, UCF_CC_50, UCF-QRNF, and UCSD datasets.
- On ShanghaiTech Part A, SFANet attains 59.8 MAE and 99.3 MSE; on Part B, 6.9 MAE and 10.9 MSE (per Table 1).
- On UCF_CC_50, SFANet achieves 219.6 MAE and 316.2 MSE.
- On UCF-QRNF, SFANet attains 100.8 MAE and 174.5 MSE.
- On UCSD, SFANet achieves 0.82 MAE and 1.07 MSE (lower is better).
- Ablation shows the attention path improves performance beyond VGG-DMP baselines, confirming its contribution.]
- table_headers: [
- データセット
- パート
- MAE
- MSE
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。