[논문 리뷰] Roboflow 100: A Rich, Multi-Domain Object Detection Benchmark
본 논문은 RF100을 소개한다. 100개 데이터셋의 다중 도메인 객체 탐지 벤치마크로 7개의 영상 도메인에 걸치며, 총 224,714장의 이미지와 805개의 클래스, 그리고 실제 세계 도메인 간 일반화 평가를 위한 베이스라인 결과를 제공한다.
The evaluation of object detection models is usually performed by optimizing a single metric, e.g. mAP, on a fixed set of datasets, e.g. Microsoft COCO and Pascal VOC. Due to image retrieval and annotation costs, these datasets consist largely of images found on the web and do not represent many real-life domains that are being modelled in practice, e.g. satellite, microscopic and gaming, making it difficult to assert the degree of generalization learned by the model. We introduce the Roboflow-100 (RF100) consisting of 100 datasets, 7 imagery domains, 224,714 images, and 805 class labels with over 11,170 labelling hours. We derived RF100 from over 90,000 public datasets, 60 million public images that are actively being assembled and labelled by computer vision practitioners in the open on the web application Roboflow Universe. By releasing RF100, we aim to provide a semantically diverse, multi-domain benchmark of datasets to help researchers test their model's generalizability with real-life data. RF100 download and benchmark replication are available on GitHub.
연구 동기 및 목표
- Provide a semantically diverse, multi-domain benchmark of closed-domain OD datasets gathered from Roboflow Universe.
- Enable researchers to test model generalizability and transfer across domain-specific tasks and zero-/few-shot settings.
- Assess how current object detectors perform across varied domains beyond COCO and Pascal VOC.
제안 방법
- curate RF100 from 100 datasets across seven semantic categories (Aerial, Videogames, Microscopic, Underwater, Documents, Electromagnetic, Real World).
- standardize images to 640x640 and clean/merge class labels to reduce ambiguity.
- report per-category metadata (datasets, images, classes) and domain-specific baselines using YOLOv5, YOLOv7, and GLIP.
- compare finetuned and zero-shot detector performance across categories, including prompt remapping for GLIP.
- provide RF100 download and replication resources on GitHub.
실험 결과
연구 질문
- RQ1Can a diverse, multi-domain benchmark reveal generalization gaps not evident in standard OD benchmarks (e.g., COCO)?
- RQ2How do finetuned detectors (YOLOv5/YOLOv7) compare with zero-shot detectors (GLIP) across distinct domains?
- RQ3What are the domain-specific performance patterns and challenges (e.g., object size, class count) across RF100 categories?
- RQ4To what extent do RF100 datasets cluster semantically, and how does that affect transfer between tasks?
주요 결과
- RF100 comprises 100 datasets, 7 domains, 224,714 images, and 805 classes with over 11,170 labeling hours.
- Per-category results show variation in model performance across domains, indicating domain-dependent generalization gaps.
- YOLOv5 and YOLOv7 achieve differing average mAP@.50 across categories, with Real World and Videogames often yielding higher scores than some other domains.
- GLIP, a zero-shot detector, exhibits much lower mAP on several RF100 categories, highlighting limits of open-vocabulary generalization to obscure imagery.
- RF100 datasets demonstrate clustering of domain semantics via CLIP embeddings, suggesting domain-specific structure within the benchmark.
- The authors provide RF100 for download and replication on GitHub to facilitate further research.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.