[Paper Review] PixelSNAIL: An Improved Autoregressive Generative Model
PixelSNAIL combines causal convolutions with self-attention to achieve state-of-the-art density estimation on CIFAR-10 and ImageNet 32×32.
Autoregressive generative models consistently achieve the best results in density estimation tasks involving high dimensional data, such as images or audio. They pose density estimation as a sequence modeling task, where a recurrent neural network (RNN) models the conditional distribution over the next element conditioned on all previous elements. In this paradigm, the bottleneck is the extent to which the RNN can model long-range dependencies, and the most successful approaches rely on causal convolutions, which offer better access to earlier parts of the sequence than conventional RNNs. Taking inspiration from recent work in meta reinforcement learning, where dealing with long-range dependencies is also essential, we introduce a new generative model architecture that combines causal convolutions with self attention. In this note, we describe the resulting model and present state-of-the-art log-likelihood results on CIFAR-10 (2.85 bits per dim) and $32 imes 32$ ImageNet (3.80 bits per dim). Our implementation is available at https://github.com/neocxi/pixelsnail-public
Motivation & Objective
- Motivate improved modeling of long-range dependencies in autoregressive density estimation for high-dimensional data.
- Introduce an architecture that integrates causal convolutions with self-attention to better capture context.
- Show state-of-the-art log-likelihood results on standard benchmarks (CIFAR-10 and ImageNet 32×32).
- Provide an open-source implementation for reproducibility and further research in autoregressive modeling.
Proposed method
- Propose PixelSNAIL architecture that interleaves residual blocks of masked 2D causal convolutions with self-attention blocks.
- Use gated activations in residual blocks with 4 convolutions per block and 256 filters per convolution.
- In attention blocks, perform a single masked key-value lookup with keys size 16 and values size 128.
- Train with discretized mixture of logistics output (10 components for CIFAR-10, 32 for ImageNet) and Polyak averaging for parameter stabilization.
- Apply dropout in CIFAR-10 model and omit dropout for ImageNet due to dataset size; implement 1×1 convolutions for projections in attention blocks.
- Provide public code implementing PixelSNAIL at the given repository.
Experimental results
Research questions
- RQ1Does combining causal convolutions with self-attention improve density estimation for autoregressive image models?
- RQ2How does PixelSNAIL perform on standard benchmarks (CIFAR-10 and ImageNet 32×32) compared to prior autoregressive models?
- RQ3What are the effects of architectural choices (block depth, attention settings, mixture components) on log-likelihood performance?
Key findings
- PixelSNAIL achieves 2.85 bits per dim on CIFAR-10 and 3.80 on ImageNet 32×32, outperforming prior autoregressive models.
- Compared to PixelRNN, PixelCNN, PixelCNN++, and Image Transformer, PixelSNAIL with integrated causal convolutions and attention yields the best log-likelihood results.
- Ablation-style results suggest that both causal convolutions and self-attention contribute to performance improvements over models using only one of the components.
- The model includes publicly available code for reproducibility.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.