QUICK REVIEW

[논문 리뷰] The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale

Alina Kuznetsova, Hassan Rom|arXiv (Cornell University)|2018. 11. 02.

Multimodal Machine Learning Applications참고 문헌 57인용 수 613

한 줄 요약

Open Images V4는 9.2M 이미지, 19.8k 개념에 대한 30.1M 이미지 수준 레이블, 600 객체 클래스에 대한 15.4M 바운딩 박스, 그리고 57 관계 클래스에 걸친 375k 시각적 관계 주석을 포함하는 통합적이고 대규모 데이터 세트를 제공합니다.

ABSTRACT

We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide 15x more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.

연구 동기 및 목표

Flickr에서 수집된 대규모 CC-BY 라이선스 데이터세트를 제공하고 편향을 줄이며 교차 작업 연구를 가능하게 하기 위해 미리 선택된 클래스 목록이 없도록 합니다.
같은 이미지에서 이미지 분류, 객체 탐지, 시각적 관계 탐지에 대한 통합 주석을 제공합니다.
데이터가 확장됨에 따라 방대한 통계 분석, 주석 품질 검증, 그리고 모델 성능의 기준선 탐색을 제공합니다.
통합 주석으로 가능해지는 응용 예를 보여주며, 미세한 수준의 탐지 및 제로샷 시각적 관계 탐지도 포함합니다.

제안 방법

중복 및 비웹 범위 이미지를 제거하는 것을 포함하여 프라이버시/편향 감소를 위한 필터링을 적용하고 CC-BY 라이선스가 있는 ~9.2M Flickr 이미지를 수집합니다.
주석을 위한 19,794개의 이미지 수준 개념과 계층적 구조를 가진 600개의 바운딩 가능한 객체 클래스를 정의합니다.
여러 이미지 분류기와 인간 검증을 결합한 컴퓨터 보조 워크플로를 통해 이미지 수준 라벨을 주석합니다.
계층적 중복 제거 및 속성 태깅을 포함하여 극단 클릭(extreme-clicking)과 박스 검증 시퀀스를 사용해 600개 객체 클래스에 대해 15.4M 바운딩 박스를 주석합니다.
가능한 관계를 실현하는 물체 쌍을 선택하고 이를 검증하여 374.8k개의 시각적 관계 삼중항을 주석하며, 비트리비얼하고 공존 기반이 아닌 관계를 포함합니다.
데이터 수집 및 주석 파이프라인을 제공하여 분류, 탐지, 시각적 관계 간의 교차 작업 학습 및 분석에 적합합니다.

실험 결과

연구 질문

RQ1분류, 탐지, 시각적 관계 작업 전반에 걸친 대규모의 통합 주석을 하나의 데이터세트에서 어떻게 수집하고 검증할 수 있는가?
RQ2이전 데이터셋과 비교하여 Open Images V4의 통계, 품질 특성 및 편향은 무엇인가?
RQ3이 규모의 학습 데이터가 증가함에 따라 현대 모델의 성능은 어떻게 변하는가?
RQ4통합 주석으로 실현 가능한 새로운 교차 작업 응용은 무엇인가(예: 명시적 박스 레이블 없이 미세한 수준의 탐지, 제로샷 관계 탐지)?

주요 결과

Open Images V4에는 9.18M 이미지, 19,794개 개념에 대한 30.11M 이미지 수준 레이블, 600 객체 클래스에 대한 15.44M 바운딩 박스, 그리고 57 관계 클래스에 걸친 374.77k 시각적 관계 삼중항이 포함되어 있습니다.
평균적으로 이미지에는 8개의 주석 객체가 포함되며, 바운딩 박스의 총합은 다음으로 큰 데이터세트의 크기의 15배 이상입니다(1.9M 이미지는 15.4M 박스).
데이터세트는 복잡한 장면과 CC-BY 라이선스를 강조하여 상업적 맥락을 포함한 넓은 사용을 가능하게 하면서도 통합 주석으로 교차 작업 연구를 가능하게 합니다.
품질 검증은 기하학적 박스 정확도와 주석 재현율을 분석하며, 모델 기준선은 데이터 규모가 커질수록 성능 경향을 보여줍니다.
통합 주석으로 가능해진 두 가지 새로운 응용이 시연됩니다: 미세한 박스 라벨 없이도 가능한 미세한 객체 탐지와 제로샷 시각적 관계 탐지.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.