Skip to main content
QUICK REVIEW

[Paper Review] Do ImageNet Classifiers Generalize to ImageNet?

Benjamin Recht, Rebecca Roelofs|arXiv (Cornell University)|Feb 13, 2019
Advanced Neural Network Applications52 references396 citations
TL;DR

The paper recreates new test sets for CIFAR-10 and ImageNet to assess generalization, finding substantial accuracy drops and a strong linear relation between original and new accuracies, suggesting brittleness to data cleaning and distribution gaps rather than adaptive overfitting.

ABSTRACT

We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% - 15% on CIFAR-10 and 11% - 14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.

Motivation & Objective

  • Assess whether image classifiers trained on CIFAR-10 and ImageNet generalize to newly collected test data from the same sources.
  • Quantify the impact of data collection/labeling variations on reported accuracies.
  • Distinguish whether drops are due to adaptivity or distributional shifts in test data.
  • Analyze how model rankings and progress translate under new test sets.
  • Provide reproducible test sets and code to facilitate future generalization studies.

Proposed method

  • Replicate the original test-set creation process for CIFAR-10 and ImageNet to obtain new test sets from the same data sources (Tiny Images for CIFAR-10; Flickr-derived images for ImageNet).
  • Manually filter candidate images to ensure label quality and match original labeling protocols (CIFAR-10 labeling by students; ImageNet MTurk-based labeling).
  • Evaluate a broad range of models spanning a decade of development (from AlexNet to state-of-the-art architectures) on both original and new test sets.
  • Decompose the accuracy gap into adaptivity, distribution, and generalization gaps, and analyze linear relationships between original and new accuracies.
  • Examine how MTurk annotation choices affect ImageNet performance by constructing three variant test sets with different selection-frequency strategies.

Experimental results

Research questions

  • RQ1How does classifier performance on newly collected test sets compare to performance on the original test sets for CIFAR-10 and ImageNet?
  • RQ2What portion of the accuracy drop can be attributed to adaptivity (overfitting to the test set) versus distributional shifts in data labeling and collection?
  • RQ3Do later models retain their relative rankings under new test sets, and is the improvement on original sets predictive of improvement on new sets?
  • RQ4How sensitive are ImageNet accuracies to MTurk labeling choices and annotation strategies?
  • RQ5Can the observed accuracy drops be explained by a simple data-difficulty model that preserves model order under distribution shifts?

Key findings

  • Significant accuracy drops for all models when evaluated on new test sets: CIFAR-10 drops 3%–15%; ImageNet drops 11%–14%.
  • On ImageNet, the best model’s drop corresponds to roughly five years of progress in the research period studied.
  • Model rankings are largely preserved between original and new test sets; higher original accuracy generally predicts higher new accuracy.
  • There is a linear relationship between original and new accuracies, with slopes greater than 1 (1.69 on CIFAR-10, 1.11 on ImageNet), indicating small original gains yield larger gains on new sets.
  • MTurk annotation strategy heavily influences accuracy on ImageNet; TopImages slightly increases accuracy, MatchedFrequency causes substantial drops, showing brittleness to labeling choices.
  • Distribution gap (differences in data collection/labeling) is identified as the primary driver of accuracy declines, more so than adaptive overfitting.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.