QUICK REVIEW

[论文解读] AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

Tao Xu, Pengchuan Zhang|arXiv (Cornell University)|Nov 28, 2017

Generative Adversarial Networks and Image Synthesis参考文献 29被引用 158

一句话总结

AttnGAN 引入了一个带注意力的多阶段 GAN，它能够从文本生成细粒度图像，并使用 DAMSM 进行细粒度的图像-文本匹配，在 CUB 和 COCO 数据集上达到最新的结果。

ABSTRACT

In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation. With a novel attentional generative network, the AttnGAN can synthesize fine-grained details at different subregions of the image by paying attentions to the relevant words in the natural language description. In addition, a deep attentional multimodal similarity model is proposed to compute a fine-grained image-text matching loss for training the generator. The proposed AttnGAN significantly outperforms the previous state of the art, boosting the best reported inception score by 14.14% on the CUB dataset and 170.25% on the more challenging COCO dataset. A detailed analysis is also performed by visualizing the attention layers of the AttnGAN. It for the first time shows that the layered attentional GAN is able to automatically select the condition at the word level for generating different parts of the image.

研究动机与目标

促使从自然语言描述中实现细粒度图像合成。
开发一个带注意力的 GAN，以在多个阶段对图像进行细化。
引入一个深度注意力多模态相似性模型（DAMSM）用于细粒度的图像-文本匹配。
在标准数据集上评估 AttnGAN 相对于先前的最新文本到图像模型的表现。
分析注意力可视化，以理解生成过程中的词级条件作用。

提出的方法

提出一个带注意力的生成网络，包含多个生成器，利用词级注意力逐步生成更高分辨率的图像并对子区域进行条件化。
结合一个注意力机制，使每个图像子区域查询相关的词向量，形成生成的多模态上下文。
将生成器与双对抗与 DAMSM 损失相耦合；对抗损失包含无条件和条件（文本-匹配）组件。
使用 DAMSM 计算细粒度的图像-文本匹配损失，使图像子区域与相应词汇对齐。
用双向 LSTM 编码文本以获得词向量和全局句子向量；通过一个 CNN（基于 Inception-v3 的编码器）将图像子区域映射到一个共同的语义空间。
通过在 GAN 损失与 DAMSM 损失之间平衡来训练模型，以鼓励词级对齐并减少模式崩塌。

实验结果

研究问题

RQ1注意力驱动的多阶段细化是否能比全局句子条件产生更高质量的细粒度图像？
RQ2深度注意力多模态相似性模型是否通过提供细粒度的图像-文本匹配损失来改善训练？
RQ3与先前的 GAN 模型相比，AttnGAN 在详细的多对象数据集（CUB 和 COCO）上的表现如何？
RQ4在生成过程中可视化注意力图可以得到哪些洞见？

主要发现

AttnGAN 在 CUB 和 COCO 上显著提高了这两个数据集的 Inception 分数；在 CUB 上达到 4.36，在 COCO 上在某一设定下达到 25.89。
分层注意力实现对子区域的词级条件化，提升了生成图像的细粒度细节。
堆叠注意力阶段（AttnGAN2）可以得到更高分辨率的输出（最多 256x256），并且比分阶段较少的设置具有更好的分数。
DAMSM 显著提升了 R-precision（图像文本匹配）和 Inception 分数；通常更高的 lambda 值同时提升这两个指标。
定性分析表明注意力聚焦于与子区域语义相关的词，当被关注的词改变时，会有有意义的变化。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。