[论文解读] Zero-Shot Automatic Annotation and Instance Segmentation using LLM-Generated Datasets: Eliminating Field Imaging and Manual Annotation for Deep Learning Model Development
本论文提出一种零-shot、LLM 生成的合成数据工作流,结合 SAM 和 YOLO11,在不进行田间数据收集或手动标注的情况下训练并验证苹果的实例分割,在真实果园图像上实现高 Dice 和 IoU 分数。
Currently, deep learning-based instance segmentation for various applications (e.g., Agriculture) is predominantly performed using a labor-intensive process involving extensive field data collection using sophisticated sensors, followed by careful manual annotation of images, presenting significant logistical and financial challenges to researchers and organizations. The process also slows down the model development and training process. In this study, we presented a novel method for deep learning-based instance segmentation of apples in commercial orchards that eliminates the need for labor-intensive field data collection and manual annotation. Utilizing a Large Language Model (LLM), we synthetically generated orchard images and automatically annotated them using the Segment Anything Model (SAM) integrated with a YOLO11 base model. This method significantly reduces reliance on physical sensors and manual data processing, presenting a major advancement in "Agricultural AI". The synthetic, auto-annotated dataset was used to train the YOLO11 model for Apple instance segmentation, which was then validated on real orchard images. The results showed that the automatically generated annotations achieved a Dice Coefficient of 0.9513 and an IoU of 0.9303, validating the accuracy and overlap of the mask annotations. All YOLO11 configurations, trained solely on these synthetic datasets with automated annotations, accurately recognized and delineated apples, highlighting the method's efficacy. Specifically, the YOLO11m-seg configuration achieved a mask precision of 0.902 and a mask mAP@50 of 0.833 on test images collected from a commercial orchard. Additionally, the YOLO11l-seg configuration outperformed other models in validation on 40 LLM-generated images, achieving the highest mask precision and mAP@50 metrics. Keywords: YOLO, SAM, SAMv2, YOLO11, YOLOv11, Segment Anything, YOLO-SAM
研究动机与目标
- 减少对农业实例分割的田间数据收集和人工标注的依赖。
- 展示使用 LLM 生成的果园图像的零-shot 检测和自动掩模生成。
- 仅在合成、自动标注数据上训练 YOLO11 实例分割模型,并在真实果园影像上验证。
- 使用 Dice、IoU、精确度和 mAP@50 等指标,在商业果园中评估模型的准确性和效率。
提出的方法
- 通过文本提示(LLM 生成的数据)使用 DALL-E 生成逼真的果园图像。
- 使用在 COCO 上训练的 YOLO11 基模型对零-shot 苹果检测,在合成图像中创建边界框。
- 使用 SAMv2 在 YOLO 检出的边界框内自动生成分割掩模。
- 仅在合成、自动标注数据上训练 YOLO11 实例分割模型(n、s、m、l、x 配置)。
- 在使用 Microsoft Azure Kinect DK 的机器人平台上拍摄的真实果园图像上验证性能;使用标准分割指标比较自动掩模与人工掩模。

实验结果
研究问题
- RQ1LLM 生成的合成果园图像是否能实现零-shot 检测并在不进行田间数据收集的情况下实现苹果的准确实例分割?
- RQ2在自动标注的合成数据上训练的基于 YOLO11 的实例分割模型在真实果园影像上的表现如何?
- RQ3在使用 SAMv2 注释训练时,不同 YOLO11 配置的准确性与效率权衡(Dice、IoU、mAP@50、推理速度)如何?
主要发现
- 使用 SAMv2 的零-shot YOLO11 可以从 LLM 生成的图像中生成自动苹果掩模,Dice 系数为 0.9513,IoU 为 0.9303。
- YOLO11m-seg 在真实果园测试图像上达到掩模精度 0.902 和掩模 mAP@50 0.833。
- YOLO11l-seg 配置在 40 张 LLM 生成的图像上提供了最高的掩模精度和 mAP@50。
- YOLO11n-seg 提供了在测试配置中最快的推理速度,为 3.8 ms。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。