QUICK REVIEW

[论文解读] Described Object Detection: Liberating Object Detection with Flexible Expressions

Chi Xie, Zhao Zhang|arXiv (Cornell University)|Jul 24, 2023

Multimodal Machine Learning Applications被引用 9

一句话总结

本论文介绍 Described Object Detection (DOD) 以及数据集 D3，用于评估对由灵活语言表达描述的对象进行检测；分析当前 SOTA 方法，并提出 OFA-DOD 作为在存在/缺失和多实例描述处理方面更强的基线。

ABSTRACT

Detecting objects based on language information is a popular task that includes Open-Vocabulary object Detection (OVD) and Referring Expression Comprehension (REC). In this paper, we advance them to a more practical setting called Described Object Detection (DOD) by expanding category names to flexible language expressions for OVD and overcoming the limitation of REC only grounding the pre-existing object. We establish the research foundation for DOD by constructing a Description Detection Dataset ($D^3$). This dataset features flexible language expressions, whether short category names or long descriptions, and annotating all described objects on all images without omission. By evaluating previous SOTA methods on $D^3$, we find some troublemakers that fail current REC, OVD, and bi-functional methods. REC methods struggle with confidence scores, rejecting negative instances, and multi-target scenarios, while OVD methods face constraints with long and complex descriptions. Recent bi-functional methods also do not work well on DOD due to their separated training procedures and inference strategies for REC and OVD tasks. Building upon the aforementioned findings, we propose a baseline that largely improves REC methods by reconstructing the training data and introducing a binary classification sub-task, outperforming existing methods. Data and code are available at https://github.com/shikras/d-cube and related works are tracked in https://github.com/Charles-Xie/awesome-described-object-detection.

研究动机与目标

提出一种实际可用的检测设定，使用超越简短类别名称或现有 REC 假设的灵活语言表达。
创建并发布 Description Detection Dataset (D3)，包含完整注释、覆盖整个数据集的注释，包括缺失表达。
在 D3 上系统性评估现有的 OVD、REC 与双功能方法，以揭示它们在 DOD 场景中的局限性。
提出一个稳健的基线（OFA-DOD），通过数据重构和二元相关性任务来提升 REC 方法，以更好地排除负样本并处理多个目标。

提出的方法

构建并注释 D3，这是一个检测风格的基准，具有完整注释、无限制的语言表达以及缺失描述。
在 D3 上评估来自 OVD、REC 和双功能族的 SOTA 方法，以建立基线。
提出并实现 OFA-DOD，这是对 OFA 基线的修改，具备粒度分解、用于 REC 的重构训练数据，以及能够实现二元负样本拒绝的任务分解。
进行消融实验，以量化粒度分解、重构数据、任务分解以及训练数据选择的贡献。
在同场景内和跨场景设置下使用多标签平均精度（mAP），评估模式包括 FULL、PRES 和 ABS。
分析每张图像的实例数量（无实例、一个实例、多个实例）以及参考长度（从短到很长）的性能变化。

实验结果

研究问题

RQ1在 D3 引入的 Described Object Detection (DOD) 设置下，现有的 OVD、REC 与双功能方法的表现如何？
RQ2当对象由灵活的语言表达描述时（包括缺失描述），当前方法的关键失败模式是什么？
RQ3基于 OFA 的修改基线（OFA-DOD）是否能改善定位、多目标处理以及对负参考的拒绝？
RQ4存在描述与缺失描述如何影响检测性能和置信度校准？
RQ5描述长度和每张图像实例数量对方法性能的影响是什么？

主要发现

Task	Method	FULL mAP	PRES mAP	ABS mAP	Inter-scenario FULL mAP	Inter-scenario PRES mAP	Inter-scenario ABS mAP
REC	OFA_base	3.4	3.0	4.3	0.1	0.1	0.1
REC	OFA_large	4.2	4.1	4.6	0.1	0.1	0.1
OVD	CORA_R50	6.2	6.7	5.0	2.0	2.2	1.3
OVD	OWL-ViT_base	8.6	8.5	8.8	3.2	3.7	4.7
OVD	OWL-ViT_large	9.6	10.7	6.4	2.5	2.9	2.1
Bi-functional	UNINEXT_large	17.9	18.6	15.9	2.9	3.1	2.5
Bi-functional	UNINEXT_huge	20.0	20.6	18.1	3.3	3.9	1.6
Bi-functional	G-DINO_tiny	19.2	18.5	21.2	2.3	2.5	2.1
Bi-functional	G-DINO_base	20.7	20.1	22.5	2.7	2.4	3.5
DOD	OFA-DOD_base	21.6	23.7	15.4	5.7	6.9	2.3

现有的 REC 方法在 D3 上表现不佳，缺乏可靠的置信度估计和负样本拒绝，尤其在多目标场景中。
OVD 方法在 D3 上优于 REC，但在长/复杂描述方面表现吃力。
双功能方法优于部分基线，但在跨场景评估和负样本拒绝方面仍有困难。
提出的 OFA-DOD 基线显著提高了 D3 上的 REC 表现，且更好地处理多目标和负样本拒绝，尽管在所有指标上还未达到最先进水平。
消融实验表明，粒度分解、REC 的重构数据以及任务分解均有助于提升性能；多任务训练数据（检测、图像文本、MLM）会影响结果，MLM 在某些设置中的贡献低于预期。
存在描述对大多数方法相对更容易；缺失描述下 REC 方法的置信分数不可靠；OFA-DOD 能在分数上更清晰地地区分真正阳性与假阳性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。