Skip to main content
QUICK REVIEW

[论文解读] RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

Ranjan Sapkota, Rahul Harsha Cheppally|ArXiv.org|Apr 17, 2025
Smart Agriculture and AI被引用 3
一句话总结

本研究直接比较 RF-DETR(基于 transformer)与 YOLOv12(基于 CNN)在复杂果园中的绿果检测,评估单类与多类(有遮挡与无遮挡)情景下的标签歧义。RF-DETR 在精度方面占优,而 YOLOv12 在边缘部署时更具高效性。

ABSTRACT

This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs

研究动机与目标

  • 评估在自定义绿果数据集上,RF-DETR 和 YOLOv12 的检测精度,涵盖单类与多类标签。
  • 在真实果园条件下评估模型在遮挡、伪装与背景杂乱中的性能。
  • 分析收敛行为与推理效率,为精准农业的部署决策提供指引。

提出的方法

  • 对 RF-DETR 与 YOLOv12 使用相同的数据集、训练协议与训练轮次。
  • RF-DETR 采用 DINOv2 骨干与可变形注意力;不使用锚框或 NMS;特征为单尺度。
  • YOLOv12 采用 R-ELAN 骨干与区域注意力;多任务头用于检测、定向边界框和实例分割。
  • 输入分辨率统一为 640x640;在 FP32 下训练,批量大小约为 16,使用 RTX A5000。
  • 以精度、召回率、F1、mAP@50 与 mAP@50:95,以及 mIoU 进行评估;并评估推理速度。
Figure 1: Classification of object detection methodologies: Top features state-of-the-art CNN-based and Transformer-based methods, widely adopted; Vision Language Models are emerging. Also includes Hybrid, Sparse Coding, and Traditional Feature-based approaches.
Figure 1: Classification of object detection methodologies: Top features state-of-the-art CNN-based and Transformer-based methods, widely adopted; Vision Language Models are emerging. Also includes Hybrid, Sparse Coding, and Traditional Feature-based approaches.

实验结果

研究问题

  • RQ1在标签歧义条件下,RF-DETR 与 YOLOv12 在单类绿果检测中的表现有何差异?
  • RQ2两种模型在多类检测中区分遮挡与非遮挡果实的表现如何?
  • RQ3在农业场景中,变换型检测器与卷积神经网络检测器的收敛动力学与训练效率为何?
  • RQ4RF-DETR 与 YOLOv12 的相对推理速度与边缘部署适用性如何?

主要发现

  • RF-DETR 在单类检测中达到 mAP@50=0.9464。
  • YOLOv12N 在单类情景下达到最高 mAP@50:95=0.7620。
  • 在多类检测中,RF-DETR 达到 mAP@50=0.8298。
  • YOLOv12L 在多类条件下引领 mAP@50:95,为 0.6622。
  • RF-DETR 显示快速收敛,在 10-20 个epochs 内便进入平台期,表现出高效的训练动力学。
Figure 2: CNN vs Transformer-based model performance comparison focusing on YOLOv12 (CNN-based) and RF-DETR (Transformer-based) architectures: (a) RF-DETR object detection model benchmark evaluation with YOLO11, YOLOv8 and other DETR-based object detection models ; (b)RF-DETR evaluation on the RF100
Figure 2: CNN vs Transformer-based model performance comparison focusing on YOLOv12 (CNN-based) and RF-DETR (Transformer-based) architectures: (a) RF-DETR object detection model benchmark evaluation with YOLO11, YOLOv8 and other DETR-based object detection models ; (b)RF-DETR evaluation on the RF100

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。