QUICK REVIEW

[Paper Review] Comparing deep neural networks against humans: object recognition when the signal gets weaker

Robert Geirhos, David Janssen|arXiv (Cornell University)|Jun 21, 2017

Visual Attention and Saliency Detection43 references154 citations

TL;DR

The paper compares human and deep neural network (DNN) object recognition under various image degradations, showing humans are more robust to some distortions while DNNs can outperform humans on clean, colored images; it provides a psychophysically controlled benchmark and analysis tools.

ABSTRACT

Human visual object recognition is typically rapid and seemingly effortless, as well as largely independent of viewpoint and object orientation. Until very recently, animate visual systems were the only ones capable of this remarkable computational feat. This has changed with the rise of a class of computer vision algorithms called deep neural networks (DNNs) that achieve human-level classification performance on object recognition tasks. Furthermore, a growing number of studies report similarities in the way DNNs and the human visual system process objects, suggesting that current DNNs may be good models of human visual object recognition. Yet there clearly exist important architectural and processing differences between state-of-the-art DNNs and the primate visual system. The potential behavioural consequences of these differences are not well understood. We aim to address this issue by comparing human and DNN generalisation abilities towards image degradations. We find the human visual system to be more robust to image manipulations like contrast reduction, additive noise or novel eidolon-distortions. In addition, we find progressively diverging classification error-patterns between humans and DNNs when the signal gets weaker, indicating that there may still be marked differences in the way humans and current DNNs perform visual object recognition. We envision that our findings as well as our carefully measured and freely available behavioural datasets provide a new useful benchmark for the computer vision community to improve the robustness of DNNs and a motivation for neuroscientists to search for mechanisms in the brain that could facilitate this robustness.

Motivation & Objective

Assess how human observers and three well-known DNNs (AlexNet, GoogLeNet, VGG-16) generalize to degraded images.
Quantify robustness differences under color, contrast, additive noise, and eidolon distortions using controlled psychophysical methods.
Provide a fine-grained, category-level comparison of error patterns between humans and DNNs.
Offer freely available datasets and analysis tools to benchmark and guide robustness improvements in DNNs.

Proposed method

Present brief, fixed-duration (200 ms) image presentations with backward masking to minimize feedback effects.
Evaluate three DNNs (AlexNet, GoogLeNet, VGG-16) on the same degraded stimuli using a center-crop, 224×224 input pipeline in Caffe.
Manipulate images via grayscale vs color, varying contrast, additive white noise, and eidolon distortions with controlled coherence.
Compute accuracy and response distribution entropy across 16 categories to assess bias in responses.
Introduce confusion difference matrices to compare category-level error patterns between humans and each DNN.
Provide a paired analysis at matched performance levels to visualize divergence in error patterns under noise.

Experimental results

Research questions

RQ1How do humans and standard DNNs differ in robustness to color, contrast, noise, and eidolon distortions during rapid object recognition?
RQ2Do DNNs and humans exhibit similar or divergent category-level error patterns under degraded image conditions?
RQ3To what extent do DNNs’ error patterns align with human performance when task difficulty is equated by matched accuracy levels?
RQ4Can the resulting behavioral datasets serve as benchmarks to improve DNN robustness and inform neuroscience research on visual processing?

Key findings

Humans are more robust than DNNs to contrast and noise degradations, with humans maintaining higher accuracy under degraded conditions.
All three DNNs show strong biases toward a few categories under degraded conditions, unlike humans who distribute responses more evenly.
DNNs can outperform humans on non-degraded colored images, but their advantage diminishes with degradation and feedback minimization.
Confusion difference matrices reveal category-specific divergences in error patterns between humans and DNNs, particularly under higher task difficulty.
The eidolon-distortion (coherence) results show humans maintain higher accuracy than DNNs at intermediate distortions, while networks converge to biased responses under strong distortions.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.