QUICK REVIEW

[论文解读] The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale

Alina Kuznetsova, Hassan Rom|arXiv (Cornell University)|Nov 2, 2018

Multimodal Machine Learning Applications参考文献 57被引用 613

一句话总结

Open Images V4 提供一个统一、大规模的数据集，包含 9.2M 张图像、30.1M 个图像级标签，覆盖 19.8k 个概念，15.4M 个边界框覆盖 600 个对象类别，以及在 57 个关系类别中共有 375k 个可视关系注释。

ABSTRACT

We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide 15x more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.

研究动机与目标

提供一个来自 Flickr 的大规模、CC-BY 许可的数据集，未预先选择类别列表，以降低偏差并实现跨任务研究。
在同一图像中提供图像分类、目标检测和视觉关系检测的统一注释。
提供广泛的统计分析、注释质量验证，以及随着训练数据规模扩大对模型性能的基线探索。
展示统一注释所带来的应用，包括细粒度检测和零样本视觉关系检测。

提出的方法

收集约 9.2M 张带有 CC-BY 许可的 Flickr 图像，并过滤以降低隐私/偏见，包括去重和排除非网页广泛可用的图像。
为注释定义 19,794 个图像级概念和 600 个可标注边界框的对象类别（具有层级结构）。
通过结合多种图像分类器和人工验证的计算机辅助工作流来标注图像级标签。
使用极端点击和边界框验证序列为 600 个对象类别标注 15.4M 个边界框，包括分层去重和属性标注。
通过选择可能实现关系的对象对并进行验证来标注 374.8k 个视觉关系三元组，包括非平凡、基于共现之外的关系。
提供一个适用于跨任务训练与分析（包括分类、检测和视觉关系）的数据收集与注释流水线。

实验结果

研究问题

RQ1如何在一个数据集中收集和验证跨分类、检测和视觉关系任务的大规模统一注释？
RQ2与此前的数据集相比，Open Images V4 的统计数据、质量特性和偏差是什么？
RQ3在这个规模上，现代模型的性能如何随着训练数据量的增加而演进？
RQ4统一注释带来哪些新的跨任务应用（例如无需显式盒标签的细粒度检测、零样本关系检测）？

主要发现

Open Images V4 包含 9.18M 张图像、30.11M 个图像级标签，覆盖 19,794 个概念，15.44M 个边界框覆盖 600 个对象类别，以及 374.77k 个视觉关系三元组，覆盖 57 个关系类别。
平均而言，图像包含 8 个标注对象，边界框总数超过下一个最大数据集的 15 倍以上（在 1.9M 张图像上有 15.4M 个边界框）。
该数据集强调复杂场景与 CC-BY 许可，以实现广泛使用，包括商业场景，同时实现统一注释下的跨任务研究。
质量验证分析几何边界框精度和注释召回率，基线模型展示数据规模扩大时的性能趋势。
展示了统一注释支持的两种新颖应用：无需细粒度盒标签的细粒度目标检测和零样本视觉关系检测。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。