QUICK REVIEW

[论文解读] RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

Ranjan Sapkota, Rahul Harsha Cheppally|ArXiv.org|Apr 17, 2025

Smart Agriculture and AI被引用 3

一句话总结

本研究直接比较 RF-DETR（基于 transformer）与 YOLOv12（基于 CNN）在复杂果园中的绿果检测，评估单类与多类（有遮挡与无遮挡）情景下的标签歧义。RF-DETR 在精度方面占优，而 YOLOv12 在边缘部署时更具高效性。

ABSTRACT

This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs

研究动机与目标

评估在自定义绿果数据集上，RF-DETR 和 YOLOv12 的检测精度，涵盖单类与多类标签。
在真实果园条件下评估模型在遮挡、伪装与背景杂乱中的性能。
分析收敛行为与推理效率，为精准农业的部署决策提供指引。

提出的方法

对 RF-DETR 与 YOLOv12 使用相同的数据集、训练协议与训练轮次。
RF-DETR 采用 DINOv2 骨干与可变形注意力；不使用锚框或 NMS；特征为单尺度。
YOLOv12 采用 R-ELAN 骨干与区域注意力；多任务头用于检测、定向边界框和实例分割。
输入分辨率统一为 640x640；在 FP32 下训练，批量大小约为 16，使用 RTX A5000。
以精度、召回率、F1、mAP@50 与 mAP@50:95，以及 mIoU 进行评估；并评估推理速度。

Figure 1: Classification of object detection methodologies: Top features state-of-the-art CNN-based and Transformer-based methods, widely adopted; Vision Language Models are emerging. Also includes Hybrid, Sparse Coding, and Traditional Feature-based approaches.

实验结果

研究问题

RQ1在标签歧义条件下，RF-DETR 与 YOLOv12 在单类绿果检测中的表现有何差异？
RQ2两种模型在多类检测中区分遮挡与非遮挡果实的表现如何？
RQ3在农业场景中，变换型检测器与卷积神经网络检测器的收敛动力学与训练效率为何？
RQ4RF-DETR 与 YOLOv12 的相对推理速度与边缘部署适用性如何？

主要发现

RF-DETR 在单类检测中达到 mAP@50=0.9464。
YOLOv12N 在单类情景下达到最高 mAP@50:95=0.7620。
在多类检测中，RF-DETR 达到 mAP@50=0.8298。
YOLOv12L 在多类条件下引领 mAP@50:95，为 0.6622。
RF-DETR 显示快速收敛，在 10-20 个epochs 内便进入平台期，表现出高效的训练动力学。

Figure 2: CNN vs Transformer-based model performance comparison focusing on YOLOv12 (CNN-based) and RF-DETR (Transformer-based) architectures: (a) RF-DETR object detection model benchmark evaluation with YOLO11, YOLOv8 and other DETR-based object detection models ; (b)RF-DETR evaluation on the RF100

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。