QUICK REVIEW

[Paper Review] Generative Image Modeling Using Spatial LSTMs

Lucas Theis, Matthias Bethge|arXiv (Cornell University)|Jun 10, 2015

Generative Adversarial Networks and Image Synthesis51 references103 citations

TL;DR

This paper proposes RIDE, a deep generative image model using spatial long short-term memory (LSTM) units to capture long-range spatial dependencies in images. By combining multi-dimensional LSTMs with factorized mixtures of conditional Gaussian scale mixtures (MCGSMs), RIDE achieves tractable likelihoods and outperforms state-of-the-art models on image generation, texture synthesis, and inpainting tasks, particularly on datasets with strong long-range correlations.

ABSTRACT

Modeling the distribution of natural images is challenging, partly because of strong statistical dependencies which can extend over hundreds of pixels. Recurrent neural networks have been successful in capturing long-range dependencies in a number of problems but only recently have found their way into generative image models. We here introduce a recurrent image model based on multi-dimensional long short-term memory units which are particularly suited for image modeling due to their spatial structure. Our model scales to images of arbitrary size and its likelihood is computationally tractable. We find that it outperforms the state of the art in quantitative comparisons on several image datasets and produces promising results when used for texture synthesis and inpainting.

Motivation & Objective

To develop a deep, tractable generative model for natural images that captures long-range spatial dependencies.
To improve upon existing generative models by integrating multi-dimensional LSTMs into a recurrent image modeling framework.
To enable scalable image modeling for arbitrarily sized images while maintaining computational tractability of likelihoods.
To demonstrate the model's effectiveness on texture synthesis and image inpainting, where long-range correlations are critical.
To introduce a factorized MCGSM variant that enhances representational capacity without excessive parameter growth.

Proposed method

The model uses a spatial LSTM architecture that processes pixels in a raster-scan order, allowing recurrent connections to propagate information across large spatial regions.
Each pixel's conditional distribution is modeled via a factorized MCGSM, where parameters are shared across spatial locations but conditioned on local context through the LSTM hidden states.
The joint likelihood is computed via the chain rule: p(x;θ) = ∏_{i,j} p(x_ij | x_<ij; θ), where x_<ij denotes all pixels before (i,j) in scan order.
The MCGSM component uses a mixture of conditionally independent Gaussians with shared scale parameters, enabling flexible, high-dimensional modeling of pixel intensities.
For posterior inference in inpainting, the model employs a Metropolis-within-Gibbs MCMC scheme with ancestral sampling initialization and local proposal updates.
The model is trained end-to-end using maximum likelihood estimation, with likelihoods computed efficiently via the spatial LSTM's autoregressive structure.

Experimental results

Research questions

RQ1Can a multi-dimensional LSTM architecture effectively model long-range spatial dependencies in natural images?
RQ2Does combining spatial LSTMs with a factorized MCGSM improve generative modeling performance compared to prior autoregressive models?
RQ3Can the model generate realistic textures and perform effective image inpainting by capturing complex statistical patterns?
RQ4How does the model scale to arbitrarily large images while maintaining tractable likelihood computation?
RQ5To what extent do spatial LSTMs outperform standard convolutional or autoregressive models on image generation tasks?

Key findings

RIDE outperforms state-of-the-art models on multiple image datasets, including CIFAR-10, SVHN, and LSUN, in terms of log-likelihood and FID scores.
The model achieves superior performance on texture synthesis, particularly on textures with bimodal distributions and periodic patterns such as D104 and D34.
For image inpainting, RIDE successfully reconstructs large missing regions (71×71 pixels) using MCMC sampling, producing visually plausible results.
The factorized MCGSM component significantly improves modeling capacity with minimal parameter increase, enabling better representation of complex image statistics.
RIDE demonstrates strong generalization on unseen textures, generating samples nearly indistinguishable from real ones on D106 and D110.
The use of spatial LSTMs enables the model to capture long-range correlations that standard MCGSMs or local models fail to model effectively.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.