Skip to main content
QUICK REVIEW

[論文レビュー] Conditional Image Generation with PixelCNN Decoders

Aäron van den Oord, Nal Kalchbrenner|arXiv (Cornell University)|Jun 16, 2016
Generative Adversarial Networks and Image Synthesis参考文献 32被引用数 799
ひとこと要約

本論文は、ラベルや埋め込みに条件づけて画像をモデル化・生成する Gated PixelCNN と Conditional PixelCNN を提案し、PixelRNN よりも高速な訓練で最先端の対数尤度を達成し、条件付き画像合成とオートエンコーダのデコードを実現する。

ABSTRACT

This work explores conditional image generation with a new image density model based on the PixelCNN architecture. The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks. When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures. When conditioned on an embedding produced by a convolutional network given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions. We also show that conditional PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally, the gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.

研究の動機と目的

  • denoise, inpainting, and conditional generation of diverse scenes.
  • Develop a faster, effective autoregressive image model by upgrading PixelCNN to a gated variant and address receptive-field blind spots.
  • Demonstrate conditioning on class labels and embeddings to enable diverse, realistic samples across multiple datasets.
  • Explore using Conditional PixelCNN as an image decoder in autoencoders to learn high-level representations.

提案手法

  • Introduce Gated PixelCNN with a gating mechanism to replace standard activations.
  • Combine two convolutional stacks (horizontal and vertical) to eliminate receptive-field blind spots.
  • Develop Conditional PixelCNN by injecting conditioning vectors into layer activations (and optionally spatial maps) to model p(x|h).
  • Formulate a PixelCNN auto-encoder by replacing the decoder with a Conditional PixelCNN and training end-to-end.

実験結果

リサーチクエスチョン

  • RQ1Can a gated, autoregressive CNN match PixelRNN performance while reducing training time?
  • RQ2Does conditioning PixelCNN on class labels or embeddings produce diverse, high-quality samples across tasks?
  • RQ3Can Conditional PixelCNN serve effectively as a decoder in autoencoders to learn different latent representations?
  • RQ4How does conditioning influence log-likelihood and visual diversity on CIFAR-10 and ImageNet-scale datasets?

主な発見

ModelNLL Test (Train)
Uniform Distribution: [ 30 ]8.00
Multivariate Gaussian: [ 30 ]4.70
NICE: [ 4 ]4.48
Deep Diffusion: [ 24 ]4.20
DRAW: [ 9 ]4.13
Deep GMMs: [ 31 , 29 ]4.00
Conv DRAW: [ 8 ]3.58 (3.57)
RIDE: [ 26 , 30 ]3.47
PixelCNN: [ 30 ]3.14 (3.08)
PixelRNN: [ 30 ]3.00 (2.93)
Gated PixelCNN3.03 (2.90)
Conv Draw: [ 8 ]4.40 (4.35)
PixelRNN: [ 30 ]3.86 (3.83)
Gated PixelCNN :3.83 (3.77)
Conv Draw: [ 8 ]4.10 (4.04)
PixelRNN: [ 30 ]3.63 (3.57)
Gated PixelCNN :3.57 (3.48)
  • Gated PixelCNN achieves comparable log-likelihood to PixelRNN on CIFAR-10 and ImageNet while using less than half the training time.
  • Class-conditioned sampling yields clearly distinct and diverse samples across 8 classes, with variations in pose and background.
  • Portrait embeddings conditioned sampling generates new faces of the same person with varied expressions, poses, and lighting; interpolation in embedding space yields smooth transitions.
  • PixelCNN auto-encoder reconstructions show qualitative differences, suggesting the encoder captures higher-level structure when used with a probabilistic PixelCNN decoder.
  • On ImageNet variants, Gated PixelCNN outperforms PixelRNN in negative log-likelihood for 32x32 and 64x64 settings, approaching state-of-the-art while remaining efficient.
  • The two-stack (horizontal and vertical) architecture removes the blind spot in receptive fields, enabling better modeling of pixel dependencies.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。