[论文解读] RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity
本研究直接比较 RF-DETR(基于 transformer)与 YOLOv12(基于 CNN)在复杂果园中的绿果检测,评估单类与多类(有遮挡与无遮挡)情景下的标签歧义。RF-DETR 在精度方面占优,而 YOLOv12 在边缘部署时更具高效性。
This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs
研究动机与目标
- 评估在自定义绿果数据集上,RF-DETR 和 YOLOv12 的检测精度,涵盖单类与多类标签。
- 在真实果园条件下评估模型在遮挡、伪装与背景杂乱中的性能。
- 分析收敛行为与推理效率,为精准农业的部署决策提供指引。
提出的方法
- 对 RF-DETR 与 YOLOv12 使用相同的数据集、训练协议与训练轮次。
- RF-DETR 采用 DINOv2 骨干与可变形注意力;不使用锚框或 NMS;特征为单尺度。
- YOLOv12 采用 R-ELAN 骨干与区域注意力;多任务头用于检测、定向边界框和实例分割。
- 输入分辨率统一为 640x640;在 FP32 下训练,批量大小约为 16,使用 RTX A5000。
- 以精度、召回率、F1、mAP@50 与 mAP@50:95,以及 mIoU 进行评估;并评估推理速度。

实验结果
研究问题
- RQ1在标签歧义条件下,RF-DETR 与 YOLOv12 在单类绿果检测中的表现有何差异?
- RQ2两种模型在多类检测中区分遮挡与非遮挡果实的表现如何?
- RQ3在农业场景中,变换型检测器与卷积神经网络检测器的收敛动力学与训练效率为何?
- RQ4RF-DETR 与 YOLOv12 的相对推理速度与边缘部署适用性如何?
主要发现
- RF-DETR 在单类检测中达到 mAP@50=0.9464。
- YOLOv12N 在单类情景下达到最高 mAP@50:95=0.7620。
- 在多类检测中,RF-DETR 达到 mAP@50=0.8298。
- YOLOv12L 在多类条件下引领 mAP@50:95,为 0.6622。
- RF-DETR 显示快速收敛,在 10-20 个epochs 内便进入平台期,表现出高效的训练动力学。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。