QUICK REVIEW

[论文解读] Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

Huijuan Xu, Kate Saenko|arXiv (Cornell University)|Nov 17, 2015

Multimodal Machine Learning Applications参考文献 30被引用 105

一句话总结

本文提出用于视觉问答（SMem-VQA）的空间记忆网络，一种具有空间注意力的多跳记忆网络，通过问题引导的图像区域注意力显式建模空间推理，从而提升VQA性能。该模型在VQA和DAQUAR数据集上达到最先进水平，在VQA测试标准划分上相比iBOWIMG基线模型提升2.35%，并能实现可解释的基于注意力的推理步骤可视化。

ABSTRACT

We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses the question to choose relevant regions for computing the answer, a process of which constitutes a single "hop" in the network. We propose a novel spatial attention architecture that aligns words with image patches in the first hop, and obtain improved results by adding a second attention hop which considers the whole question to choose visual evidence based on the results of the first hop. To better understand the inference process learned by the network, we design synthetic questions that specifically require spatial inference and visualize the attention weights. We evaluate our model on two published visual question answering datasets, DAQUAR [1] and VQA [2], and obtain improved results compared to a strong deep baseline model (iBOWIMG) which concatenates image and question features to predict the answer [3].

研究动机与目标

解决现有VQA模型依赖全局图像特征和循环网络时缺乏显式空间推理的问题。
通过记忆网络架构建模物体位置与关系，使视觉问答模型能够执行多步空间推理。
设计一种问题引导的空间注意力机制，将问题中的单个词语与特定图像区域对齐，以实现细粒度证据收集。
通过需要空间推理的合成问题评估模型的推理过程，并通过可视化注意力权重来解释模型行为。
在标准VQA和DAQUAR基准上实现相比强基线模型（包括iBOWIMG和DPPnet）的性能提升。

提出的方法

该模型使用记忆网络，将图像不同空间区域的特征激活存储为记忆向量，从而实现对视觉特征的空间注意力。
在第一跳中，通过计算每个词嵌入与图像块特征之间的相关性得分，应用词级别注意力，实现问题词语与图像区域之间的细粒度对齐。
在第二跳中，模型利用完整的问题嵌入和第一跳中获得的注意力特征，计算更精确的注意力图，以选择更准确的视觉证据用于答案预测。
网络通过答案预测的交叉熵损失进行端到端训练，注意力权重通过反向传播学习。
探索了第三跳，但发现性能未提升，表明两跳之后收益递减。
在VQA和DAQUAR数据集上评估该模型，通过可视化注意力权重来解释空间推理过程。

实验结果

研究问题

RQ1具有空间注意力的记忆网络能否学会在图像区域上执行多跳推理以回答视觉问题？
RQ2与基于全局图像特征的模型相比，问题引导的空间注意力是否能提升VQA性能？
RQ3该模型的注意力机制是否可可视化，以揭示基于空间关系的逻辑推理步骤？
RQ4两跳注意力机制的性能与单跳和三跳变体相比如何？
RQ5需要空间推理的合成问题能否有效探测并验证模型的推理能力？

主要发现

SMem-VQA两跳模型在VQA数据集上的测试标准准确率达到58.24%，相比iBOWIMG基线（55.89%）提升2.35个百分点。
在DAQUAR数据集上，SMem-VQA两跳模型达到79.05%的准确率，优于iBOWIMG基线（76.55%）。
该模型在各类答案类别中表现出更优的准确率，尤其在复杂空间推理类别中，表明其在空间问题上具有更好的泛化能力。
注意力权重的可视化证实，模型能够将特定问题词语（如“cat”、“basket”）与对应图像区域对齐，实现可解释的推理。
增加第二跳相比单跳版本（VQA测试标准上为56.56%）提升了性能，表明多跳推理有助于增强空间推理能力。
第三跳未提升性能，表明在此设置下两跳已足够实现有效的空间注意力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。