Skip to main content
QUICK REVIEW

[Paper Review] Generalized Denoising Auto-Encoders as Generative Models

Yoshua Bengio, Li Yao|arXiv (Cornell University)|May 29, 2013
Neural Networks and Applications20 references210 citations
TL;DR

This paper proposes a generalized denoising auto-encoder (DAE) framework that treats the denoising process as a probabilistic estimator of the true data-generating distribution. By alternating sampling from the DAE's conditional reconstruction $P(X|\tilde{X})$ and the corruption process ${\cal C}(\tilde{X}|X)$, a Markov chain converges to the true data distribution, enabling effective generative sampling across discrete and continuous data with arbitrary corruption and loss functions.

ABSTRACT

Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying data-generating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty).

Motivation & Objective

  • To provide a formal probabilistic interpretation of denoising autoencoders that extends beyond Gaussian noise and squared error to arbitrary data types and corruption processes.
  • To resolve the limitations of prior work that required infinitesimal noise and specific loss functions, by establishing a general theoretical foundation for DAEs as implicit density estimators.
  • To enable effective generative sampling from DAEs using a Markov chain that alternates between the denoising model $P(X|\tilde{X})$ and the corruption process ${\cal C}(\tilde{X}|X)$, even with non-infinitesimal noise.
  • To validate the method empirically on both non-parametric artificial data and real-world data (e.g., MNIST), demonstrating robustness and improved sample quality.
  • To introduce walkback training, a novel training procedure that enhances convergence and sample quality by using the model’s own reconstruction to define the corruption process.

Proposed method

  • The core method treats the DAE as a conditional model $P(X|\tilde{X})$, trained to reconstruct clean inputs $X$ from corrupted versions $\tilde{X}$, where the corruption process ${\cal C}(\tilde{X}|X)$ is arbitrary but has broad support.
  • A Markov chain is constructed by alternating: (1) sampling $X$ from $P(X|\tilde{X})$, the DAE’s reconstruction distribution, and (2) sampling $\tilde{X}$ from ${\cal C}(\tilde{X}|X)$, the corruption process, ensuring convergence to the true data distribution $P(X)$.
  • The method generalizes prior work by allowing any reconstruction loss to be interpreted as a log-likelihood, enabling use with discrete data (e.g., Bernoulli) and non-Gaussian corruption (e.g., salt-and-pepper noise).
  • Theoretical justification is provided by showing that the stationary distribution of the Markov chain equals the true data distribution $P(X)$, under mild conditions on the corruption process.
  • The walkback training procedure is introduced, where the corruption process is defined by the DAE’s own reconstruction, mimicking contrastive divergence and improving training stability and sample quality.
  • Empirical validation uses both non-parametric models (e.g., multinomial, Parzen estimators) and parametric DAEs with deep neural networks on MNIST and synthetic data.

Experimental results

Research questions

  • RQ1Can denoising autoencoders be formally interpreted as implicit density estimators for arbitrary data types and corruption processes?
  • RQ2Does a Markov chain alternating between $P(X|\tilde{X})$ and ${\cal C}(\tilde{X}|X)$ converge to the true data-generating distribution $P(X)$, even with non-infinitesimal noise?
  • RQ3Can the framework support arbitrary reconstruction losses, including cross-entropy for discrete data, and still yield valid generative sampling?
  • RQ4How does the proposed walkback training procedure compare to standard DAE training in terms of convergence speed and sample quality?
  • RQ5Can the method generate high-quality samples on real-world data such as MNIST, and how does it compare to state-of-the-art models like RBMs?

Key findings

  • The Markov chain that alternates between the DAE’s reconstruction $P(X|\tilde{X})$ and the corruption process ${\cal C}(\tilde{X}|X)$ converges to the true data distribution $P(X)$, providing a sound theoretical basis for generative sampling.
  • On binarized MNIST with salt-and-pepper noise (50% corruption), the DAE achieved a non-parametric log-likelihood bound of -116 with walkback training and -142 without, outperforming a baseline RBM’s bound of -233 before blurring.
  • After applying a spatial blur (Gaussian convolution) to RBM samples, the log-likelihood bound improved to -112, but no such improvement was observed for the DAE, suggesting the DAE samples were already high-quality.
  • The walkback training procedure produced visibly less spurious samples than standard sampling, as confirmed by visual inspection and quantitative log-likelihood bounds.
  • The method successfully recovered the true data distribution on synthetic data with 10 discrete values and 10-dimensional continuous data, validating the non-parametric setup.
  • The theoretical framework generalizes prior results by removing the need for infinitesimal noise and restricting to Gaussian corruption or squared error, enabling broader applicability to real-world data.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.