QUICK REVIEW

[论文解读] Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Liwei Wang, Alexander G. Schwing|arXiv (Cornell University)|Nov 19, 2017

Multimodal Machine Learning Applications参考文献 33被引用 65

一句话总结

本文提出了两种基于 CVAE 的模型（GMM-CVAE 和 AG-CVAE），通过对潜在空间进行多高斯分量的结构化，以生成更丰富且更准确的图像字幕，在 MSCOCO 上优于 vanilla CVAEs 和 LSTMs，AG-CVAE 提供更高的多样性和可控性。

ABSTRACT

This paper explores image caption generation using conditional variational auto-encoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around $K$ components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a "vanilla" CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.

研究动机与目标

推动超越固定高斯先验的多样且准确的图像字幕生成。
提出将潜在空间结构化为对应图像内容模态的多个高斯分量。
提出两种先验：高斯混合模型 (GMM) 先验和加性高斯 (AG) 先验。
展示相较基线在多样性和准确性上的提升，并实现可控字幕生成。

提出的方法

通过对图像内容向量 c(I) 条件化，扩展 CVAE 框架用于图像字幕生成。
引入 GMM-CVAE：先验 p(z|c) 作为权重 c 与分量 (μ_k, σ_k) 的高斯混合。
引入 AG-CVAE：先验 p(z|c) 作为分量均值的线性组合，权重为 c_k，从而得到 p(z|c) = N(z | sum_k c_k μ_k, σ^2 I)。
推导两种先验的可处理 KL 项，以训练编码器 q_phi(z|x,c)。
训练时使用真实对象注释；测试时通过目标检测获得 c(I)。
基于 LSTMs 的编码器/解码器架构；从条件于图像内容的先验中抽取 z；反向传播使用重参数化技巧。

实验结果

研究问题

RQ1将潜在空间结构化为多高斯分量是否能够在不牺牲准确性的前提下提升字幕多样性？
RQ2GMM-CVAE 和 AG-CVAE 是否在 MSCOCO 上比 vanilla CVAE 和 LSTM 产生更丰富且可控的字幕？
RQ3先验选择（GMM 与加性高斯）如何影响多样性、可控性和重排序性能？
RQ4AG-CVAE 是否在捕捉对象共现和实现基于内容的字幕控制方面更有效？

主要发现

表格标题	表1：Oracle 性能指标：	表2：共识重排序性能（基于 CIDEr）	表3：多样性评估（唯一和新颖句子）
obj	#z	std	beam	B4	B3	B2	B1	C	R	M	S
LSTM	-	-	10	0.413	0.515	0.643	0.790	1.157	0.597	0.285	0.218
LSTM	✓	-	10	0.428	0.529	0.654	0.797	1.202	0.607	0.290	0.223
CVAE	-	20	0.1	-	0.261	0.381	0.538	0.742	0.860	0.531	0.246	0.184
CVAE	✓	20	2	-	0.312	0.421	0.565	0.733	0.910	0.541	0.244	0.176
GMM-CVAE	-	20	0.1	-	0.371	0.481	0.619	0.778	1.080	0.582	0.274	0.209
GMM-CVAE	✓	20	2	-	0.423	0.533	0.666	0.813	1.216	0.617	0.298	0.233
GMM-CVAE	✓	100	2	-	0.494	0.597	0.719	0.856	1.378	0.659	0.325	0.261
GMM-CVAE	✓	100	2	2	0.527	0.625	0.740	0.865	1.430	0.670	0.329	0.277
AG-CVAE	-	20	0.1	-	0.431	0.537	0.668	0.814	1.230	0.622	0.300	0.235
AG-CVAE	✓	20	2	-	0.451	0.557	0.686	0.829	1.259	0.630	0.305	0.243
AG-CVAE	✓	100	2	-	0.532	0.631	0.749	0.876	1.478	0.682	0.342	0.278
AG-CVAE	✓	100	2	2	0.557	0.654	0.767	0.883	1.517	0.690	0.345	0.277

在标准字幕评估指标的上界 oracle 评估中，GMM-CVAE 和 AG-CVAE 均优于 LSTM 基线和 vanilla CVAE。
AG-CVAE 通常在多样性和可控性方面优于 GMM-CVAE，拥有更多每张图片的独特字幕以及对内容向量的更好响应。
共识重排序显示 GMM-CVAE 和 AG-CVAE 在 CIDEr 指标上优于基线，AG-CVAE 的分数略高。
相比 LSTM 波束搜索，CVAE 变体在利用多重 z 样本时展现出更高的多样性（Table 3）。
通过修改内容向量 c(I)，AG-CVAE 实现了直观、可解释的字幕控制。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。