QUICK REVIEW

[Paper Review] Multiple Object Recognition with Visual Attention

Jimmy Ba, Volodymyr Mnih|arXiv (Cornell University)|Dec 24, 2014

Advanced Image and Video Retrieval Techniques6 references701 citations

TL;DR

This paper proposes a deep recurrent attention model (DRAM) that uses reinforcement learning to sequentially attend to relevant image regions for multi-object recognition. It outperforms state-of-the-art convolutional networks on SVHN house number recognition with fewer parameters and less computation, especially on larger, less-cropped images.

ABSTRACT

We present an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. We show that the model learns to both localize and recognize multiple objects despite being given only class labels during training. We evaluate the model on the challenging task of transcribing house number sequences from Google Street View images and show that it is both more accurate than the state-of-the-art convolutional networks and uses fewer parameters and less computation.

Motivation & Objective

Address the scalability and efficiency limitations of convolutional neural networks (ConvNets) when processing large images.
Enable end-to-end training of a model that jointly localizes and recognizes multiple objects using only class labels during training.
Develop a flexible, efficient architecture that scales to variable input sizes and handles variable-length object sequences.
Improve performance on real-world, less-ideal image data (e.g., larger, less tightly cropped images) compared to standard ConvNets.

Proposed method

Use a deep recurrent neural network that processes multi-resolution image crops, called glimpses, at each time step.
Train the model via reinforcement learning to maximize a variational lower bound on the log-likelihood of the label sequence.
Employ a glimpse network to extract features from attended image regions and a recurrent controller to decide the next glimpse location.
Use a policy network to output glimpse locations and optionally predict object classes, with the process continuing until no more objects are detected.
Apply stochasticity in the glimpse policy during training to improve generalization and reduce overfitting.
Fine-tune the model on larger images by reapplying it to cropped regions around previously attended locations, enabling adaptation without retraining.

Experimental results

Research questions

RQ1Can an end-to-end trainable model learn to localize and recognize multiple objects in images using only class-level supervision?
RQ2Does an attention-based approach outperform standard ConvNets in accuracy and efficiency, especially on large or poorly cropped images?
RQ3Can a model trained on tightly cropped images generalize to larger, less-cropped inputs without retraining?
RQ4How does the computational cost and parameter efficiency of the attention model compare to that of deep ConvNets across different image sizes?
RQ5To what extent does the stochastic glimpse policy improve generalization and reduce overfitting compared to standard regularization in ConvNets?

Key findings

The DRAM model achieves state-of-the-art performance on the multi-digit SVHN recognition task, outperforming the best ConvNets on both tightly cropped and larger, less-cropped images.
On 54x54 cropped images, the DRAM model achieves a test error rate comparable to the best ConvNets but with significantly fewer parameters and lower computational cost.
On 110x110 enlarged images, the DRAM model outperforms the fine-tuned ConvNet by a large margin, demonstrating superior robustness to image scale and noise.
The DRAM model requires only a few hours to fine-tune on larger images, whereas the 10-layer ConvNet requires about a week to train from scratch.
The model's computational cost is independent of input image size, as it only processes selected glimpses, making it highly efficient for large inputs.
The DRAM model is less prone to overfitting than ConvNets, with dropout providing only a marginal 0.1% performance boost, while the ConvNet requires heavy dropout to reach 5.5% error rate.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.