[论文解读] YOLOv1 to YOLOv11: A Comprehensive Survey of Real-Time Object Detection Innovations and Challenges
从 v1 到 v9 的 YOLO 系列的综合综述,讨论新兴的 v10/v11、架构创新、性能基准、部署与开放挑战。
Over the past decade, object detection has advanced significantly, with the YOLO (You Only Look Once) family of models transforming the landscape of real-time vision applications through unified, end-to-end detection frameworks. From YOLOv1's pioneering regression-based detection to the latest YOLOv9, each version has systematically enhanced the balance between speed, accuracy, and deployment efficiency through continuous architectural and algorithmic advancements.. Beyond core object detection, modern YOLO architectures have expanded to support tasks such as instance segmentation, pose estimation, object tracking, and domain-specific applications including medical imaging and industrial automation. This paper offers a comprehensive review of the YOLO family, highlighting architectural innovations, performance benchmarks, extended capabilities, and real-world use cases. We critically analyze the evolution of YOLO models and discuss emerging research directions that extend their impact across diverse computer vision domains.
研究动机与目标
- Survey the evolution of YOLO architectures from v1 to v9 (and brief notes on v10/v11).
- Analyze how backbone, neck, head, loss, and training strategies shaped speed-accuracy on standard benchmarks.
- Contextualize performance (mAP, FPS) on PASCAL VOC and COCO and discuss deployment on edge vs server.
- Identify open challenges such as training stability, domain shift robustness, and interpretability, and propose future directions.
提出的方法
- Categorical taxonomy of YOLO innovations across five axes: Backbone, Neck, Detection Head, Loss and Assignment, Training Strategies.
- Chronological review of YOLO versions v1–v9 with key architectural and training developments.
- Performance benchmarking references including mAP and FPS on COCO and VOC datasets.
- Discussion of deployment characteristics across edge and server environments for real-time tasks.
- Compilation of open challenges and proposed future research directions.
实验结果
研究问题
- RQ1What architectural changes across YOLO versions most improved the speed-accuracy trade-off?
- RQ2How have training strategies and loss/assignment functions evolved to improve convergence and robustness?
- RQ3What are the practical deployment implications (edge vs server) for each YOLO generation?
- RQ4What open challenges remain in YOLO (training stability, domain shift, interpretability) and where should future work focus?
- RQ5How do newer YOLO versions (v8–v9, with mentions of v10/v11) extend capabilities to segmentation, pose estimation, and multi-task learning?
主要发现
- YOLO evolution shows steady gains in mAP and FPS across generations, with notable breakthroughs in backbone and neck designs (e.g., Darknet-53, CSPDarknet, PANet, GELAN).
- Anchor-based to anchor-free transitions (v8) and decoupled heads (v6–v7) consistently improved localization and robustness, especially for small objects.
- Advanced training strategies (Mosaic, CutMix, EMA, SimOTA, DFL v2) and re-parameterization enabled better convergence and deployment efficiency.
- YOLOv4–v9 demonstrate state-of-the-art real-time detection performance on COCO, with significant edge deployment readiness and multi-task capabilities (segmentation, pose estimation).
- The survey highlights open challenges such as training stability in anchor-free variants, robustness under domain shift, and interpretability, outlining directions for future research.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。