QUICK REVIEW

[Paper Review] Learning Wake-Sleep Recurrent Attention Models

Jimmy Ba, Roger Grosse|arXiv (Cornell University)|Sep 22, 2015

Multimodal Machine Learning Applications29 references22 citations

TL;DR

This paper proposes the Wake-Sleep Recurrent Attention Model (WS-RAM), a training method for stochastic hard attention networks that improves posterior inference and reduces gradient variance using reweighted wake-sleep learning and control variates. The approach achieves performance comparable to variational inference with significantly faster training, demonstrating state-of-the-art efficiency in image classification and caption generation tasks.

ABSTRACT

Despite their success, convolutional neural networks are computationally expensive because they must examine all image locations. Stochastic attention-based models have been shown to improve computational efficiency at test time, but they remain difficult to train because of intractable posterior inference and high variance in the stochastic gradient estimates. Borrowing techniques from the literature on training deep generative models, we present the Wake-Sleep Recurrent Attention Model, a method for training stochastic attention networks which improves posterior inference and which reduces the variability in the stochastic gradients. We show that our method can greatly speed up the training time for stochastic attention networks in the domains of image classification and caption generation.

Motivation & Objective

To address the challenge of training stochastic hard attention models, which suffer from intractable posterior inference and high-variance gradients.
To improve training efficiency of attention-based models without sacrificing performance on image classification and caption generation.
To develop a unified training procedure that combines inference networks, reweighted wake-sleep learning, and variance reduction via control variates.
To enable faster convergence and better exploration in attention policy learning compared to existing variational baselines.

Proposed method

The WS-RAM uses a generative network to model the attention policy and a separate inference network to approximate the posterior over glimpse locations, with access to the label during training.
It applies the reweighted wake-sleep algorithm to jointly train the generative and inference networks, improving posterior approximation through iterative refinement.
Importance sampling with proposal distributions from the inference network is used to estimate intractable posterior expectations during training.
Control variates are introduced to reduce the variance of stochastic gradient estimates, accelerating convergence.
The method incorporates exploration heuristics to prevent premature convergence to suboptimal policies, especially in the variational baseline.
The model is trained end-to-end using stochastic backpropagation with gradient estimates derived from importance sampling and control variates.

Experimental results

Research questions

RQ1Can a reweighted wake-sleep approach improve posterior inference in stochastic hard attention models?
RQ2Does the use of control variates significantly reduce gradient variance in attention model training?
RQ3Can the WS-RAM achieve comparable performance to variational inference with substantially faster training times?
RQ4How does the inclusion of an inference network with label access affect attention policy learning?
RQ5To what extent do exploration heuristics improve training stability and convergence in stochastic attention models?

Key findings

The WS-RAM achieved a test error rate of 1.62% on translated and scaled MNIST after 10 million updates, outperforming the variational baseline (3.11%) and the ablated WS-RAM without control variates (1.85%).
The WS-RAM reduced training time significantly compared to the variational baseline, achieving similar performance with faster convergence, as shown in training curves on both MNIST and Flickr8k.
The use of control variates reduced gradient variance by 40-50% compared to baseline methods, as evidenced by lower gradient variance estimates and higher effective sample size (ESS) in importance sampling.
The inference network improved posterior approximation, though this did not always translate to higher ESS, indicating that variance reduction was primarily driven by control variates.
The WS-RAM did not require exploration heuristics to avoid local minima, unlike the variational baseline, which collapsed to a single glimpse scale without them.
On the Flickr8k dataset, the WS-RAM achieved BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores of 61.1, 40.4, 26.9, and 17.8, respectively, matching the variational method’s performance (62.3, 41.6, 26.9, 17.2) but with faster training.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.