Skip to main content
QUICK REVIEW

[论文解读] TAC-GAN - Text Conditioned Auxiliary Classifier Generative Adversarial Network

Ayushman Dash, John Cristian Borges Gamboa|arXiv (Cornell University)|Mar 19, 2017
Generative Adversarial Networks and Image Synthesis参考文献 16被引用 111
一句话总结

TAC-GAN 通过在判别器中对文本嵌入和一个辅助分类器进行条件化,从文本描述生成图像,在判别性和多样性方面比现有的文本到图像模型实现更高的表现。

ABSTRACT

In this work, we present the Text Conditioned Auxiliary Classifier Generative Adversarial Network, (TAC-GAN) a text to image Generative Adversarial Network (GAN) for synthesizing images from their text descriptions. Former approaches have tried to condition the generative process on the textual data; but allying it to the usage of class information, known to diversify the generated samples and improve their structural coherence, has not been explored. We trained the presented TAC-GAN model on the Oxford-102 dataset of flowers, and evaluated the discriminability of the generated images with Inception-Score, as well as their diversity using the Multi-Scale Structural Similarity Index (MS-SSIM). Our approach outperforms the state-of-the-art models, i.e., its inception score is 3.45, corresponding to a relative increase of 7.8% compared to the recently introduced StackGan. A comparison of the mean MS-SSIM scores of the training and generated samples per class shows that our approach is able to generate highly diverse images with an average MS-SSIM of 0.14 over all generated classes.

研究动机与目标

  • Motivate generating diverse, discriminable images from textual descriptions.
  • Incorporate text embeddings into a GAN framework via an auxiliary classifier to improve structure and content coherence.
  • Evaluate synthesis quality and diversity using Inception Score and MS-SSIM on Oxford-102 flowers.
  • Demonstrate interpolation in text and style/content disentanglement to show controllable generation.

提出的方法

  • Extend AC-GAN by conditioning the generator on text embeddings (Skip-Thought) instead of class labels.
  • Represent text via a text embedding Ψ(t) and learn a latent text representation lg=LG(Ψ(t)) that is concatenated with a noise vector z.
  • Construct a generator G that outputs 128x128x3 images from zc = [lg; z] through transposed convolutions.
  • Design a discriminator D that receives real, fake, and wrong-image triplets alongside their corresponding text embeddings and class labels, producing DS (real/fake) and DC (class) outputs.
  • Train with LDS and LCD losses for the discriminator, and LGS and LGC losses for the generator, encouraging realistic and correctly labeled outputs.
  • Optionally extend the framework to include additional information by adding a new discriminator output DL_Y and corresponding losses.

实验结果

研究问题

  • RQ1Can TAC-GAN generate images that are both discriminable and faithful to textual descriptions?
  • RQ2Does conditioning on text embeddings with an auxiliary classifier improve image quality and diversity compared to prior text-to-image methods?
  • RQ3How does TAC-GAN compare to StackGAN and other baselines in terms of inception score and diversity metrics?
  • RQ4Is it possible to interpolate in text and style to produce coherent variations of generated images?

主要发现

模型Inception Score
TAC-GAN3.45±0.05
StackGan3.20±0.01
GAN-INT-CLS2.66±0.03
  • Inception Score of TAC-GAN is 3.45±0.05, higher than StackGAN's 3.20±0.01 and GAN-INT-CLS's 2.66±0.03.
  • TAC-GAN achieves diverse samples, with mean MS-SSIM over all generated classes at 0.13±0.016, close to the training data mean of 0.14±0.019 and higher diversity than some baselines.
  • The model demonstrates content/style disentanglement, evidenced by content-preserving interpolation over different noise vectors and text embeddings.
  • Mean MS-SSIM comparison shows the generated samples are more diverse than the training data in aggregate, supporting the diversity claim.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。