[論文レビュー] YOLOv1 to YOLOv11: A Comprehensive Survey of Real-Time Object Detection Innovations and Challenges
A comprehensive survey of the YOLO family from v1 through v9, with discussions of emerging v10/v11, architectural innovations, performance benchmarks, deployment, and open challenges.
Over the past decade, object detection has advanced significantly, with the YOLO (You Only Look Once) family of models transforming the landscape of real-time vision applications through unified, end-to-end detection frameworks. From YOLOv1's pioneering regression-based detection to the latest YOLOv9, each version has systematically enhanced the balance between speed, accuracy, and deployment efficiency through continuous architectural and algorithmic advancements.. Beyond core object detection, modern YOLO architectures have expanded to support tasks such as instance segmentation, pose estimation, object tracking, and domain-specific applications including medical imaging and industrial automation. This paper offers a comprehensive review of the YOLO family, highlighting architectural innovations, performance benchmarks, extended capabilities, and real-world use cases. We critically analyze the evolution of YOLO models and discuss emerging research directions that extend their impact across diverse computer vision domains.
研究の動機と目的
- Survey the evolution of YOLO architectures from v1 to v9 (and brief notes on v10/v11).
- Analyze how backbone, neck, head, loss, and training strategies shaped speed-accuracy on standard benchmarks.
- Contextualize performance (mAP, FPS) on PASCAL VOC and COCO and discuss deployment on edge vs server.
- Identify open challenges such as training stability, domain shift robustness, and interpretability, and propose future directions.
提案手法
- Categorical taxonomy of YOLO innovations across five axes: Backbone, Neck, Detection Head, Loss and Assignment, Training Strategies.
- Chronological review of YOLO versions v1–v9 with key architectural and training developments.
- Performance benchmarking references including mAP and FPS on COCO and VOC datasets.
- Discussion of deployment characteristics across edge and server environments for real-time tasks.
- Compilation of open challenges and proposed future research directions.
実験結果
リサーチクエスチョン
- RQ1What architectural changes across YOLO versions most improved the speed-accuracy trade-off?
- RQ2How have training strategies and loss/assignment functions evolved to improve convergence and robustness?
- RQ3What are the practical deployment implications (edge vs server) for each YOLO generation?
- RQ4What open challenges remain in YOLO (training stability, domain shift, interpretability) and where should future work focus?
- RQ5How do newer YOLO versions (v8–v9, with mentions of v10/v11) extend capabilities to segmentation, pose estimation, and multi-task learning?
主な発見
| YOLO Version | Backbone | Anchor Type | Feature Fusion | mAP@0.5 (COCO) | Speed (FPS) | Key Highlights |
|---|---|---|---|---|---|---|
| YOLOv1 | Custom CNN | None | None | 63.4% (VOC) | 45 | 最初の統一検出器 |
| YOLOv2 | Darknet-19 | Anchor-based | None | 76.8% (VOC), 21.6% (COCO) | 67 | YOLO9000, k-means, multi-scale |
| YOLOv3 | Darknet-53 | Anchor-based | Multi-scale | 57.9% | 30–45 | 小 Objectsの検出性能向上 |
| YOLOv4 | CSPDarknet-53 | Anchor-based | PANet + SPP | 43.5% (AP) | 62–65 | BoF/BoS, Mish, CutMix |
| YOLOv5 | CSPDarknet (PyTorch) | AutoAnchor | PANet | 50.1% | 60+ | Model scaling, exportability |
| YOLOv6 | EfficientRepNet | Hybrid | RepPAN | 52.5% | 70+ | Anchor-free option, decoupled head |
| YOLOv7 | E-ELAN | Anchor-based | PANet + E-ELAN | 56.8% | 60+ | RepConv, Coarse-to-fine head |
| YOLOv8 | C2f Modules | Anchor-free | FPN-style | 53.0% | 60–80 | Multi-task, modernized head |
| YOLOv9 | GELAN | Anchor-free | GELAN-FPN | 56.0%+ | 50–60 | SimOTA, DFLv2, scalable variants |
- YOLO evolution shows steady gains in mAP and FPS across generations, with notable breakthroughs in backbone and neck designs (e.g., Darknet-53, CSPDarknet, PANet, GELAN).
- Anchor-based to anchor-free transitions (v8) and decoupled heads (v6–v7) consistently improved localization and robustness, especially for small objects.
- Advanced training strategies (Mosaic, CutMix, EMA, SimOTA, DFL v2) and re-parameterization enabled better convergence and deployment efficiency.
- YOLOv4–v9 demonstrate state-of-the-art real-time detection performance on COCO, with significant edge deployment readiness and multi-task capabilities (segmentation, pose estimation).
- The survey highlights open challenges such as training stability in anchor-free variants, robustness under domain shift, and interpretability, outlining directions for future research.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。