Skip to main content
QUICK REVIEW

[论文解读] Conditional Image Generation with PixelCNN Decoders

Aäron van den Oord, Nal Kalchbrenner|arXiv (Cornell University)|Jun 16, 2016
Generative Adversarial Networks and Image Synthesis参考文献 32被引用 799
一句话总结

本论文介绍 Gated PixelCNN 与 Conditional PixelCNN,用于在标签或嵌入条件下建模和生成图像;在 PixelRNN 之上的训练时间更短,且实现了与最先进的似然度并驾齐驱的结果,同时实现条件图像合成与自编码解码。

ABSTRACT

This work explores conditional image generation with a new image density model based on the PixelCNN architecture. The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks. When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures. When conditioned on an embedding produced by a convolutional network given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions. We also show that conditional PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally, the gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.

研究动机与目标

  • Motivate conditional image modeling for tasks like denoising, inpainting, and conditional generation of diverse scenes.
  • Develop a faster, effective autoregressive image model by upgrading PixelCNN to a gated variant and address receptive-field blind spots.
  • Demonstrate conditioning on class labels and embeddings to enable diverse, realistic samples across multiple datasets.
  • Explore using Conditional PixelCNN as an image decoder in autoencoders to learn high-level representations.

提出的方法

  • Introduce Gated PixelCNN with a gating mechanism to replace standard activations.
  • Combine two convolutional stacks (horizontal and vertical) to eliminate receptive-field blind spots.
  • Develop Conditional PixelCNN by injecting conditioning vectors into layer activations (and optionally spatial maps) to model p(x|h).
  • Formulate a PixelCNN auto-encoder by replacing the decoder with a Conditional PixelCNN and training end-to-end.

实验结果

研究问题

  • RQ1Can a gated, autoregressive CNN match PixelRNN performance while reducing training time?
  • RQ2Does conditioning PixelCNN on class labels or embeddings produce diverse, high-quality samples across tasks?
  • RQ3Can Conditional PixelCNN serve effectively as a decoder in autoencoders to learn different latent representations?
  • RQ4How does conditioning influence log-likelihood and visual diversity on CIFAR-10 and ImageNet-scale datasets?

主要发现

模型NLL 测试(训练)
均匀分布: [ 30 ]8.00
多变量高斯: [ 30 ]4.70
NICE: [ 4 ]4.48
深度扩散: [ 24 ]4.20
DRAW: [ 9 ]4.13
深度高斯混合模型: [ 31 , 29 ]4.00
Conv DRAW: [ 8 ]3.58 (3.57)
RIDE: [ 26 , 30 ]3.47
PixelCNN: [ 30 ]3.14 (3.08)
PixelRNN: [ 30 ]3.00 (2.93)
Gated PixelCNN3.03 (2.90)
Conv Draw: [ 8 ]4.40 (4.35)
PixelRNN: [ 30 ]3.86 (3.83)
Gated PixelCNN :3.83 (3.77)
Conv Draw: [ 8 ]4.10 (4.04)
PixelRNN: [ 30 ]3.63 (3.57)
Gated PixelCNN :3.57 (3.48)
  • Gated PixelCNN achieves comparable log-likelihood to PixelRNN on CIFAR-10 and ImageNet while using less than half the training time.
  • Class-conditioned sampling yields clearly distinct and diverse samples across 8 classes, with variations in pose and background.
  • Portrait embeddings conditioned sampling generates new faces of the same person with varied expressions, poses, and lighting; interpolation in embedding space yields smooth transitions.
  • PixelCNN auto-encoder reconstructions show qualitative differences, suggesting the encoder captures higher-level structure when used with a probabilistic PixelCNN decoder.
  • On ImageNet variants, Gated PixelCNN outperforms PixelRNN in negative log-likelihood for 32x32 and 64x64 settings, approaching state-of-the-art while remaining efficient.
  • The two-stack (horizontal and vertical) architecture removes the blind spot in receptive fields, enabling better modeling of pixel dependencies.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。