QUICK REVIEW

[论文解读] Scaling Open-Vocabulary Object Detection

Matthias Minderer, Alexey A. Gritsenko|arXiv (Cornell University)|Jun 16, 2023

Multimodal Machine Learning Applications被引用 28

一句话总结

本文引入 OWLv2 和 OWL-ST，通过极少筛选进行网络级自训练以扩展开放词汇对象检测，在 LVIS 稀有类上达到最先进的结果，包括对 LVIS 稀有的零-shot 提升最高可达 44.6% mAP（L/14）和 47.2%（ViT-G/14）。

ABSTRACT

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

研究动机与目标

通过来自网络数据的丰富弱监督推动开放词汇检测的扩展。
开发一个可扩展的自训练方法（OWL-ST），采用极少筛选和以数据为中心的标签空间。
提升训练效率，最大化每个计算单元看到的图像数量（ token dropping、实例选择、拼贴 mosaics）。
在 LVIS、ODinW 以及野外数据集上评估开放词汇检测，以衡量泛化能力和微调效应。

提出的方法

以 WebLI（10B 图文对）作为伪标注的弱监督。
尝试两种标签空间策略：人工整理的固定词汇表和来自图像文本的机器生成的 N-gram。
应用极简筛选（置信度阈值筛选，每张图保留所有伪标注中至少一个高于 0.3、且所有伪标注中不低于 0.1 的）。
以 CLIP/SigLIP 视觉-语言骨干初始化检测器，使用 OWL-ViT 风格的检测头；对伪标注自训练，然后可选地对 LVIS base 进行微调。
通过以下方式提升训练效率：按补丁方差进行 token dropping（大约 drop 50%），使用对象性头来选择大约前 10% 的标记，Mosaic 拼接（最多 6x6 网格）以增加每个批次的有效样本数量，以及其他大规模 Transformer 训练实践。
模型变体 OWLv2 实现了每个样本 FLOPs 约减少 50%，相较于 OWL-ViT 的吞吐量提升约 2x；推理阶段保持与训练阶段相同的骨干与检测头。

实验结果

研究问题

RQ1在没有人工标注框的情况下，使用网络级弱监督，开放词汇对象检测能扩展到何种程度？
RQ2标签空间设计（人工整理 vs. 机器生成 vs. 混合）对未见类别和野外数据集的泛化有何影响？
RQ3在大规模下，哪种伪标注的筛选策略在偏差与方差之间取得最佳权衡？
RQ4效率优化（token dropping、实例选择、Mosaic）在大规模下如何影响检测精度？
RQ5微调对开放词汇性能和分布鲁棒性的影响，以及集成是否能缓解权衡？

主要发现

在 WebLI 数据上，使用机器生成的 N-gram 提示的 OWL-ST 即使不使用人工框注也能获得强大的开放词汇性能。
在 LVIS base 微调下，OWL-ST+FT 在 ViT-G/14 上达到 47.2% 的 LVIS mAPRare，在 ViT-L/14 上达到 44.6%，相比先前的方法在未见类别上有很大提升。
规模化自训练在中等计算预算下也可获得显著提升，并呈现类似于图像级模型的扩展趋势，较大模型从更多数据中获益更多。
纯机器生成的标签空间（N-grams）在未见与野外数据上的泛化能力优于固定的人工整理词汇表，混合标签空间在多种设置下表现良好。
微调提升目标数据集上的性能，但可能降低开放世界/泛化性能，可通过对微调前后的权重空间进行集成来缓解。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。