[論文レビュー] SynMVCrowd: A Large Synthetic Benchmark for Multi-view Crowd Counting and Localization
SynMVCrowd introduces a large synthetic benchmark for multi-view crowd counting and localization, with 50 scenes, 50 camera views, 200 frames per scene, and 200–1000 people per scene, plus a strong multi-view baseline that outperforms existing methods.
Existing multi-view crowd counting and localization methods are evaluated under relatively small scenes with limited crowd numbers, camera views, and frames. This makes the evaluation and comparison of existing methods impractical, as small datasets are easily overfit by these methods. To avoid these issues, 3DROM proposes a data augmentation method. Instead, in this paper, we propose a large synthetic benchmark, SynMVCrowd, for more practical evaluation and comparison of multi-view crowd counting and localization tasks. The SynMVCrowd benchmark consists of 50 synthetic scenes with a large number of multi-view frames and camera views and a much larger crowd number (up to 1000), which is more suitable for large-scene multi-view crowd vision tasks. Besides, we propose strong multi-view crowd localization and counting baselines that outperform all comparison methods on the new SynMVCrowd benchmark. Moreover, we prove that better domain transferring multi-view and single-image counting performance could be achieved with the aid of the benchmark on novel new real scenes. As a result, the proposed benchmark could advance the research for multi-view and single-image crowd counting and localization to more practical applications. The codes and datasets are here: https://github.com/zqyq/SynMVCrowd.
研究の動機と目的
- Motivate evaluation and comparison of multi-view crowd counting/localization under large-scale, cross-scene settings.
- Provide a large synthetic benchmark to reduce overfitting and improve generalization for real-world deployment.
- Establish strong baselines for multi-view counting and localization that outperform existing methods on the new benchmark.
- Explore cross-domain benefits for domain transfer to novel real scenes using SynMVCrowd.
提案手法
- Extend the GTA-V-based GCC synthetic pipeline to generate 50 scenes, 50 camera views, and 200 frames per scene.
- Create detailed scene setup including ROI-based crowd placement, weather and time variation, and a camera arrangement that covers each scene.
- Define a character setup with diverse avatars, random but controlled actions, and unique IDs for precise tracking across views.
- Synthesize scenes by incrementally populating subareas to exceed the 256-person limit per GTA-V scene and merge to produce multi-view frames with ground-truth annotations.
- Propose a strong multi-view baseline with modules for single-view feature extraction, spatial feature selection, multi-view feature projection and fusion, and multi-view decoding, trained with either MSE or Optimal Transport loss.
- Evaluate baselines against state-of-the-art multi-view methods and analyze cross-scene generalization and single-image applicability.
実験結果
リサーチクエスチョン
- RQ1Can a large synthetic benchmark with diverse scenes, camera views, and crowd densities better evaluate and compare multi-view crowd counting/localization methods under cross-scene settings?
- RQ2Do strong multi-view baselines trained on SynMVCrowd surpass existing methods on this benchmark and generalize to novel scenes?
- RQ3What is the impact of using optimal transport loss versus MSE loss for multi-view crowd localization on SynMVCrowd?
- RQ4Can SynMVCrowd promote improvements in single-image crowd counting/localization alongside multi-view tasks?
- RQ5How does SynMVCrowd help assess domain transfer to real-world scenes?
主な発見
| Method | MODA | MODP | Precision | Recall | F1_score |
|---|---|---|---|---|---|
| MVDet | 27.0 | 52.2 | 72.2 | 43.9 | 54.6 |
| SHOT | 32.5 | 52.6 | 74.5 | 49.3 | 59.3 |
| MVDeTr | 35.6 | 69.7 | 95.4 | 37.4 | 53.7 |
| 3DROM | 24.2 | 59.2 | 86.1 | 28.8 | 43.2 |
| SVCW | 35.8 | 55.6 | 75.8 | 51.7 | 61.4 |
| MVOT | 45.5 | 66.3 | 83.4 | 56.9 | 67.6 |
| TrackTacular | 45.8 | 71.1 | 92.6 | 49.8 | 64.8 |
| Baseline (MSE) | 34.6 | 74.5 | 92.9 | 37.4 | 53.4 |
| Baseline (OT) | 49.6 | 70.2 | 88.6 | 57.0 | 69.4 |
- SynMVCrowd is the largest synthetic benchmark for multi-view and single-image crowd counting/localization, with 50 scenes, 50 camera views, 200 frames per scene, and 200–1000 people per scene.
- A proposed Baseline (OT) outperforms all listed baseline and SOTA multi-view localization methods on SynMVCrowd across MODA, MODP, Precision, Recall, and F1_score.
- Certain existing methods like SHOT, MVDeTr, and MVOT show strengths in specific metrics (e.g., multi-height fusion, deformable fusion, or point-supervision), but overall Baseline (OT) achieves the best balance of localization metrics on SynMVCrowd.
- SynMVCrowd demonstrates improved generalization and potential benefits for cross-scene domain transfer to novel real scenes, indicating practical value for real-world deployment and cross-domain research.
- The dataset supports both multi-view and single-image tasks, enabling evaluation of cross-domain performance and transferability.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。