QUICK REVIEW

[Paper Review] ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky|arXiv (Cornell University)|Sep 1, 2014

Image Retrieval and Classification Techniques53 citations

TL;DR

This paper introduces the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a large-scale benchmark for object classification and detection across 1,000 categories and over a million images. It details the creation of the dataset using crowdsourcing, outlines key algorithmic advances enabled by the scale of the data, and compares state-of-the-art performance with human-level accuracy, highlighting breakthroughs in deep learning and object recognition.

ABSTRACT

The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.

Motivation & Objective

To establish a large-scale, standardized benchmark for object recognition and detection using 1,000 object categories and over a million images.
To address the challenges of collecting and validating large-scale, accurate image annotations at scale using novel crowdsourcing techniques.
To track and analyze the evolution of object recognition algorithms, particularly deep learning models, over five years of annual competition.
To compare the performance of state-of-the-art computer vision systems with human-level accuracy on image classification and object detection tasks.
To provide insights into statistical properties of object categories and their impact on recognition performance, guiding future algorithm development.

Proposed method

Employed a hybrid crowdsourcing pipeline using Amazon Mechanical Turk and in-house verification to annotate 1.2 million images with bounding boxes and class labels.
Implemented a multi-stage annotation process with quality control, including duplicate detection and manual verification of overlapping bounding boxes.
Used a validation set of 50,000 images and a test set of 150,000 images, with test annotations withheld to prevent overfitting.
Developed a standardized competition protocol with a public evaluation server, allowing teams to submit predictions and receive automated feedback.
Applied strict evaluation metrics for object detection, penalizing duplicate detections and requiring precise localization and classification.
Released code for performance evaluation to ensure consistency and reproducibility across submissions.

Experimental results

Research questions

RQ1How can large-scale, high-quality image annotations be collected efficiently and accurately at scale?
RQ2What are the key algorithmic advancements in object recognition enabled by the availability of a large, diverse dataset like ImageNet?
RQ3How does the performance of state-of-the-art computer vision models compare to human-level accuracy in image classification and object detection?
RQ4What statistical properties of object categories influence recognition performance, and how can they inform future model design?
RQ5What are the long-term trends and lessons learned from five years of annual benchmarking in large-scale visual recognition?

Key findings

The ILSVRC dataset, with 1.2 million images and 1,000 object categories, enabled unprecedented progress in object recognition, particularly through deep learning.
The use of crowdsourcing with quality control reduced annotation errors, with only 0.6% of bounding boxes being duplicates and 1% of boxes showing significant overlap, most of which were corrected.
The challenge revealed that models trained on ImageNet achieved top-5 validation error rates below 15% by 2014, approaching human-level performance.
Human-level accuracy on the ImageNet classification task was estimated to be around 5.1% top-1 error, with models closing the gap rapidly through deep convolutional networks.
The detection task remained more challenging, with top models achieving mAP around 30% on the PASCAL VOC-style evaluation, significantly below human performance.
The benchmark facilitated the rise of deep learning in computer vision, with models like AlexNet and GoogLeNet achieving major performance gains on the challenge.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.