QUICK REVIEW

[论文解读] YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications

Chuyi Li, Lulu Li|arXiv (Cornell University)|Sep 7, 2022

Advanced Neural Network Applications被引用 1,733

一句话总结

YOLOv6 形成一系列面向工业应用的单阶段检测器，具有可重新参数化块的骨干网/颈部、解耦头部、TAL 标签分配、先进损失、自蒸馏以及量化策略，实现在多种模型规模上的速度-精度权衡的最前沿水平。

ABSTRACT

For years, the YOLO series has been the de facto industry-level standard for efficient object detection. The YOLO community has prospered overwhelmingly to enrich its use in a multitude of hardware platforms and abundant scenarios. In this technical report, we strive to push its limits to the next level, stepping forward with an unwavering mindset for industry application. Considering the diverse requirements for speed and accuracy in the real environment, we extensively examine the up-to-date object detection advancements either from industry or academia. Specifically, we heavily assimilate ideas from recent network design, training strategies, testing techniques, quantization, and optimization methods. On top of this, we integrate our thoughts and practice to build a suite of deployment-ready networks at various scales to accommodate diversified use cases. With the generous permission of YOLO authors, we name it YOLOv6. We also express our warm welcome to users and contributors for further enhancement. For a glimpse of performance, our YOLOv6-N hits 35.9% AP on the COCO dataset at a throughput of 1234 FPS on an NVIDIA Tesla T4 GPU. YOLOv6-S strikes 43.5% AP at 495 FPS, outperforming other mainstream detectors at the same scale~(YOLOv5-S, YOLOX-S, and PPYOLOE-S). Our quantized version of YOLOv6-S even brings a new state-of-the-art 43.3% AP at 869 FPS. Furthermore, YOLOv6-M/L also achieves better accuracy performance (i.e., 49.5%/52.3%) than other detectors with a similar inference speed. We carefully conducted experiments to validate the effectiveness of each component. Our code is made available at https://github.com/meituan/YOLOv6.

研究动机与目标

以实际服务环境中的速度-精度平衡为目标，推动并设计 YOLO 家族的工业友好延续。
开发一系列可扩展网络（N、S、M、L），利用可重新参数化的块以及高效的颈部/头部，在常见硬件上最大化吞吐量。
整合先进的训练策略（自蒸馏、标签分配 TAL、专门化损失）和面向部署的量化（RepOptimizer、带通道对向蒸馏的 QAT），以提升实际场景性能。
在 COCO 数据集上将 YOLOv6 与当前最先进的检测器进行对比评估，展示在多种尺寸下更快的推理速度和具有竞争力的精度。

提出的方法

为小型模型引入 EfficientRep 骨干网，为较大模型引入 CSPStackRep Block，以在速度与精度之间取得平衡。
采用 Rep-PAN 颈部和高效解耦头，采用混合通道策略以降低计算量。
默认采用 TAL（任务对齐学习）进行标签分配，替代 SimOTA，以提升稳定性和性能。
在消融研究基础上，为分类选择 VariFocal Loss，为回归选择 SIoU/GIoU 变体；在较大模型上有选择地加入 DFL/DFLv2。
引入行业友好技巧：更长的训练、自蒸馏（教师 = 自身）、灰边处理以及加载训练周期。
应用基于 RepOptimizer 的训练以获得 PTQ 友好权重；使用带通道对向蒸馏的 QAT 以及用于量化感知部署的图优化。

实验结果

研究问题

RQ1在工业场景中的速度-精度需求下，跨模型规模（N、S、M、L）应采用哪种最优的骨干网和颈部设计（单路径 vs 多分支）？
RQ2标签分配策略（ATSS、SimOTA、TAL 等）如何影响训练稳定性和最终 mAP？
RQ3哪种分类和定位损失函数在各模型尺度下可在保持推理速度的同时最大化准确性？
RQ4哪些面向部署的量化策略（RepOptimizer、带通道对向蒸馏的 QAT）在最大化加速的同时最小化精度损失？
RQ5在标准硬件上，与 YOLOv5/YOLOx/PPYOLOE/YOLOv7 相比，YOLOv6 变体在 COCO 上的综合性能（AP 和 FPS）如何？

主要发现

模型	输入尺寸	AP 值	AP50 值	FPS（bs=1）	FPS（bs=32）	延迟	参数量	FLOPs
YOLOv6-N	640	35.9%	51.2%	802	1234	1.2 ms	4.3 M	11.1 G
YOLOv6-S	640	43.5%	60.4%	358	495	2.8 ms	17.2 M	44.2 G
YOLOv6-M	640	49.5%	66.8%	179	233	5.6 ms	34.3 M	82.2 G
YOLOv6-L	640	52.5%	70.0%	98	121	10.2 ms	58.5 M	144.0 G
YOLOv6-L-ReLU	640	51.7%	69.2%	113	149	8.8 ms	58.5 M	144.0 G

YOLOv6-N 在 Tesla T4 上达到 35.9% 的 AP，FPS 为 1234（bs=32）；在 bs=1 时为 802 FPS，1.2 ms 延迟，展示实时性能。
YOLOv6-S 在 T4 上达到 43.5% 的 AP，495 FPS（bs=32），在同等尺寸下超越 YOLOv5-S 和 YOLOX-S；量化后的 YOLOv6-S 在 869 FPS 下达到 43.3% AP。
YOLOv6-M 在 233 FPS（bs=32）时达到 49.5% 的 AP，延迟 5.6 ms，超越了同等速度的检测器；YOLOv6-L 在 121 FPS（bs=32）时达到 52.5% 的 AP，延迟 10.2 ms，FLOPs 为 144.0 G。
YOLOv6-L-ReLU 变体提供具有竞争力的精度/速度权衡；使用 ReLU 的 L 模型在 149 FPS 下达到 51.7% AP。
在消融实验中，TAL 在标签分配方面持续优于 SimOTA 和 ATSS；VFL 相较于 Focal Loss 提升分类略有提高；SIoU/CIoU 作为回归损失在各模型变体中均取得最佳结果。
量化方法，包括基于 RepOptimizer 的 PTQ 和带通道对向蒸馏的 QAT，在像 Tesla T4 这样的硬件上实现了更适合部署的精度与显著的加速。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。