QUICK REVIEW

[论文解读] A Focused Dynamic Attention Model for Visual Question Answering

Ilija Ilievski, Shuicheng Yan|arXiv (Cornell University)|Apr 6, 2016

Multimodal Machine Learning Applications参考文献 24被引用 130

一句话总结

FDA 使用基于问题引导的聚焦动态注意，结合局部和全局视觉特征以及通过 LSTMs 处理的问题，以在开放式和多项选择 VQA 基准上实现最先进的结果。

ABSTRACT

Visual Question and Answering (VQA) problems are attracting increasing interest from multiple research disciplines. Solving VQA problems requires techniques from both computer vision for understanding the visual contents of a presented image or video, as well as the ones from natural language processing for understanding semantics of the question and generating the answers. Regarding visual content modeling, most of existing VQA methods adopt the strategy of extracting global features from the image or video, which inevitably fails in capturing fine-grained information such as spatial configuration of multiple objects. Extracting features from auto-generated regions -- as some region-based image recognition methods do -- cannot essentially address this problem and may introduce some overwhelming irrelevant features with the question. In this work, we propose a novel Focused Dynamic Attention (FDA) model to provide better aligned image content representation with proposed questions. Being aware of the key words in the question, FDA employs off-the-shelf object detector to identify important regions and fuse the information from the regions and global features via an LSTM unit. Such question-driven representations are then combined with question representation and fed into a reasoning unit for generating the answers. Extensive evaluation on a large-scale benchmark dataset, VQA, clearly demonstrate the superior performance of FDA over well-established baselines.

研究动机与目标

提升 VQA 的视觉内容建模能力，超越全局图像特征。
开发一个以问题驱动的注意机制，聚焦于相关的图像区域。
将聚焦的区域特征与全局图像上下文和问题表示进行融合。
在大型 VQA 基准上展示相对于基线和以往的注意模型的性能提升。

提出的方法

从图像中提取全局与基于区域的 CNN 特征。
使用目标检测器识别与问题相关的候选区域。
将图像区域和整图上下文作为输入给 LSTM，按问题单词的顺序对视觉信息进行编码。
用 LSTM 编码问题以获得问题表示。
应用聚焦的动态注意机制，按问题单词顺序对区域特征进行排序并与全局特征结合。
通过 tanh 和 ReLU 激活来融合问题与视觉表示，随后进行逐元素相乘和前馈网络，通过对最常见的 1000 个答案进行 SoftMax 预测答案。

实验结果

研究问题

RQ1基于问题驱动的聚焦在面向对象的图像区域上，是否比全局或非聚焦注意方法提升了 VQA 的准确性？
RQ2将局部区域特征与全局上下文结合，如何影响开放式和多项选择 VQA 任务？
RQ3基于 LSTM 的问题与聚焦视觉特征融合，能否在 VQA 基准上达到最先进的结果？

主要发现

FDA 在 VQA 数据集的开放式和多项选择任务上达到最先进的性能。
开放式测试-dev：FDA 59.24 (All)，81.14 (Y/N)，45.77 (Other)，36.16 (Num)；test-std：59.54 (All)。
多项选择测试-dev：FDA 64.01 (All)，81.50 (Y/N)，54.72 (Other)，39.00 (Num)；test-std：64.18 (All)。
FDA 比 SAN 基线在开放式上领先约 0.6%，在多项选择任务上领先约 1.1%。
定性结果显示模型聚焦于相关区域后，在颜色、计数和对象识别等问题上提高了准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。