QUICK REVIEW

[论文解读] ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models

Jingyuan Zhu, Shiyu Li|arXiv (Cornell University)|May 24, 2024

Machine Learning and Data Classification被引用 5

一句话总结

ODGEN 微调扩散模型在领域特定数据上，使用对象级提示和合成对象补丁来生成高质量、带边框约束的图像，从而提升检测器训练效果，优于此前可控生成方法。

ABSTRACT

Modern diffusion-based image generative models have made significant progress and become promising to enrich training data for the object detection task. However, the generation quality and the controllability for complex scenes containing multi-class objects and dense objects with occlusions remain limited. This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes, thereby facilitating data synthesis for object detection. Given a domain-specific object detection dataset, we first fine-tune a pre-trained diffusion model on both cropped foreground objects and entire images to fit target distributions. Then we propose to control the diffusion model using synthesized visual prompts with spatial constraints and object-wise textual descriptions. ODGEN exhibits robustness in handling complex scenes and specific domains. Further, we design a dataset synthesis pipeline to evaluate ODGEN on 7 domain-specific benchmarks to demonstrate its effectiveness. Adding training data generated by ODGEN improves up to 25.3% mAP@.50:.95 with object detectors like YOLOv5 and YOLOv7, outperforming prior controllable generative methods. In addition, we design an evaluation protocol based on COCO-2014 to validate ODGEN in general domains and observe an advantage up to 5.6% in mAP@.50:.95 against existing methods.

研究动机与目标

为领域特定对象检测在数据稀缺或专业化场景下，推动改进的数据增强。
开发一种结合全场景与裁剪对象补丁的扩散模型微调策略，以更好地捕捉领域特性。
引入对象级条件化，使用独立的文本与视觉提示来控制多对象，避免概念溢出。
提出一个数据集综合管线，生成并筛选伪标签，以为检测器创建有效训练数据。
展示在领域专用和通用基准上的检测性能与保真度提升。

提出的方法

在全图像和裁剪前景补丁上微调预训练的扩散模型，使其分布与目标域相匹配。
使用冻结的 CLIP 文本编码器对每个对象类分别进行编码以避免干扰，然后堆叠嵌入并用可训练的文本嵌入编码器处理。
生成合成前景补丁，按边界框粘贴到空白画布上，并将其用作 ControlNet 的视觉条件。
训练前景/背景判别器，以验证伪标记区域是否包含合成对象并筛除无效标签。
构建一个数据集综合管线，从训练集中估计对象分布、采样伪标签、合成图像、并筛除损坏标签，以提升检测器训练。

实验结果

研究问题

RQ1如何生成高保真、基于边框条件的域特定图像用于对象检测？
RQ2使用独立文本与视觉提示的对象级条件化是否能减少概念溢出并改善多对象场景合成？
RQ3同時使用全场景与前景裁剪来微调扩散模型对检测性能的影响如何？
RQ4ODGEN 生成的合成数据在域专用和通用域中对检测性能的提升是否超过先前可控生成方法？
RQ5使用合成数据对检测器进行评估时，哪种评估协议最能验证保真度（FID）和可训练性（mAP）？

主要发现

数据集	基线 (mAP YOLOv5s/YOLOv7)	ReCo (mAP)	GLIGEN (mAP)	ControlNet (mAP)	GeoDiffusion (mAP)	ODGEN (mAP)
Apex Game	38.3 / 47.2	25.0 / 31.5	24.8 / 32.5	33.8 / 42.7	29.2 / 35.8	39.9 / 52.6
Robomaster	27.2 / 26.5	18.2 / 27.9	19.1 / 25.0	24.4 / 32.9	18.2 / 22.6	39.6 / 34.7
MRI Image	37.6 / 27.4	42.7 / 38.3	32.3 / 25.9	44.7 / 37.2	42.0 / 38.9	46.1 / 41.5
Cotton	16.7 / 20.5	29.3 / 37.5	28.0 / 39.0	22.6 / 35.1	30.2 / 36.0	42.0 / 43.2
Road Traffic	35.3 / 41.0	22.8 / 29.3	22.2 / 29.5	22.1 / 30.5	17.2 / 29.4	39.2 / 43.8
Aquarium	30.0 / 29.6	23.8 / 34.3	24.1 / 32.2	18.2 / 25.6	21.6 / 30.9	32.2 / 38.5
Underwater	16.7 / 19.4	13.7 / 15.8	14.9 / 18.5	15.5 / 17.8	13.8 / 17.2	19.2 / 22.0

ODGEN 在七个域特定数据集上实现比此前可控方法更低的 FID。
添加 ODGEN 合成数据后，RF7 数据集上 YOLOv5s/YOLOv7 的 mAP@0.5:0.95 提升最高达 25.3 个百分点。
在基于 COCO 的通用域评估中，ODGEN 相较现有方法在 mAP@0.5:0.95 上的优势高达 5.6%。
对象级文本和图像清单有助于缓解干扰和遮挡问题，提升保真度和布局准确性。
前景区域重新加权（gamma）与腐坏标签筛选有助于提升保真度和检测性能。
ODGEN 在 COCO 和 RF7 基准测试中在保真度（FID）和可训练性（mAP）上均优于 ReCo、GLIGEN、ControlNet、GeoDiffusion 与 MIGC。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。