QUICK REVIEW

[论文解读] TAC-GAN - Text Conditioned Auxiliary Classifier Generative Adversarial Network

Ayushman Dash, John Cristian Borges Gamboa|arXiv (Cornell University)|Mar 19, 2017

Generative Adversarial Networks and Image Synthesis参考文献 16被引用 111

一句话总结

TAC-GAN 通过在判别器中对文本嵌入和一个辅助分类器进行条件化，从文本描述生成图像，在判别性和多样性方面比现有的文本到图像模型实现更高的表现。

ABSTRACT

In this work, we present the Text Conditioned Auxiliary Classifier Generative Adversarial Network, (TAC-GAN) a text to image Generative Adversarial Network (GAN) for synthesizing images from their text descriptions. Former approaches have tried to condition the generative process on the textual data; but allying it to the usage of class information, known to diversify the generated samples and improve their structural coherence, has not been explored. We trained the presented TAC-GAN model on the Oxford-102 dataset of flowers, and evaluated the discriminability of the generated images with Inception-Score, as well as their diversity using the Multi-Scale Structural Similarity Index (MS-SSIM). Our approach outperforms the state-of-the-art models, i.e., its inception score is 3.45, corresponding to a relative increase of 7.8% compared to the recently introduced StackGan. A comparison of the mean MS-SSIM scores of the training and generated samples per class shows that our approach is able to generate highly diverse images with an average MS-SSIM of 0.14 over all generated classes.

研究动机与目标

Motivate generating diverse, discriminable images from textual descriptions.
Incorporate text embeddings into a GAN framework via an auxiliary classifier to improve structure and content coherence.
Evaluate synthesis quality and diversity using Inception Score and MS-SSIM on Oxford-102 flowers.
Demonstrate interpolation in text and style/content disentanglement to show controllable generation.

提出的方法

Extend AC-GAN by conditioning the generator on text embeddings (Skip-Thought) instead of class labels.
Represent text via a text embedding Ψ(t) and learn a latent text representation lg=LG(Ψ(t)) that is concatenated with a noise vector z.
Construct a generator G that outputs 128x128x3 images from zc = [lg; z] through transposed convolutions.
Design a discriminator D that receives real, fake, and wrong-image triplets alongside their corresponding text embeddings and class labels, producing DS (real/fake) and DC (class) outputs.
Train with LDS and LCD losses for the discriminator, and LGS and LGC losses for the generator, encouraging realistic and correctly labeled outputs.
Optionally extend the framework to include additional information by adding a new discriminator output DL_Y and corresponding losses.

实验结果

研究问题

RQ1Can TAC-GAN generate images that are both discriminable and faithful to textual descriptions?
RQ2Does conditioning on text embeddings with an auxiliary classifier improve image quality and diversity compared to prior text-to-image methods?
RQ3How does TAC-GAN compare to StackGAN and other baselines in terms of inception score and diversity metrics?
RQ4Is it possible to interpolate in text and style to produce coherent variations of generated images?

主要发现

模型	Inception Score
TAC-GAN	3.45±0.05
StackGan	3.20±0.01
GAN-INT-CLS	2.66±0.03

Inception Score of TAC-GAN is 3.45±0.05, higher than StackGAN's 3.20±0.01 and GAN-INT-CLS's 2.66±0.03.
TAC-GAN achieves diverse samples, with mean MS-SSIM over all generated classes at 0.13±0.016, close to the training data mean of 0.14±0.019 and higher diversity than some baselines.
The model demonstrates content/style disentanglement, evidenced by content-preserving interpolation over different noise vectors and text embeddings.
Mean MS-SSIM comparison shows the generated samples are more diverse than the training data in aggregate, supporting the diversity claim.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。