QUICK REVIEW

[论文解读] Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering

Vahid Kazemi, Ali Elqursh|arXiv (Cornell University)|Apr 11, 2017

Multimodal Machine Learning Applications参考文献 2被引用 148

一句话总结

本文提出了一个简单且强大的视觉问答基线，使用一个 LSTM 问题编码器、一个 ResNet 图像编码器、对图像区域的软注意力，以及一个两层分类器，在 VQA 1.0 和 VQA 2.0 上超越了此前的最先进方法。

ABSTRACT

This paper presents a new baseline for visual question answering task. Given an image and a question in natural language, our model produces accurate answers according to the content of the image. Our model, while being architecturally simple and relatively small in terms of trainable parameters, sets a new state of the art on both unbalanced and balanced VQA benchmark. On VQA 1.0 open ended challenge, our model achieves 64.6% accuracy on the test-standard set without using additional data, an improvement of 0.4% over state of the art, and on newly released VQA 2.0, our model scores 59.7% on validation set outperforming best previously reported results by 0.5%. The results presented in this paper are especially interesting because very similar models have been tried before but significantly lower performance were reported. In light of the new results we hope to see more meaningful research on visual question answering in the future.

研究动机与目标

激励并建立一个强大、简单的 VQA 基线，挑战认为要达到最先进性能就需要更复杂的架构的观点。
证明训练中的细节处理（归一化、 dropout、软注意力）在即使是紧凑模型下也能带来显著提升。
量化在 VQA 1.0（test-standard）和 VQA 2.0（validation）上的表现，并与先前的最先进方法进行比较。

提出的方法

使用词嵌入喂入的 LSTM 对问题进行编码。
使用预训练的 152 层 ResNet 提取图像特征，取最后一个卷积层的输出（14x14x2048）并应用 L2 归一化。
在空间图像特征上应用叠加的软注意力机制，使其以 LSTM 状态为条件获得多个图像窥视（glimpses）。
将图像窥视与最终的 LSTM 状态连接起来，经过一个两层分类器输出最常见答案（前 3000）的概率。
使用每个问题的所有正确答案的交叉熵损失进行平均训练，采用 Adam 优化器和 dropout 进行正则化。

实验结果

研究问题

RQ1相对简单的架构结合细致的训练细节，是否能够在 VQA 1.0 和 VQA 2.0 上达到最先进的结果？
RQ2归一化、 dropout、注意力机制和架构选择对 VQA 表现的影响是什么？
RQ3提出的基线与标准 VQA 基准上的现有方法相比如何？
RQ4软注意力对 VQA 模型的性能提升是否必不可少？
RQ5超参数（嵌入维度、LSTM 大小、注意力大小、分类器大小）对准确率的影响是什么？

主要发现

在 VQA 1.0 的 test-standard 精度达到 64.6%，且不使用额外数据，比以前的最佳方法高出 0.4%。
在 VQA 2.0 验证集上得分 59.7%，比先前最佳高出 0.5%。
对图像特征进行 L2 归一化、 dropout 和软注意力显著提升了准确性和训练效率。
叠加注意力相较于强基线的单一注意力收益有限；两层分类器对性能有显著帮助。
该模型使用基于 ResNet 的图像嵌入和一个 1024 维的 LSTM，词嵌入维度为 300；若干超参数在合理范围内对性能影响有限。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。