QUICK REVIEW

[论文解读] Graph Reasoning Networks for Visual Question Answering.

Dalu Guo, Chang Xu|arXiv (Cornell University)|Jul 23, 2019

Multimodal Machine Learning Applications参考文献 1被引用 9

一句话总结

本文提出图推理网络（GRN）用于视觉问答，通过两种图结构建模物体间关系：用于对齐视觉物体与问题词语的跨图，以及用于推理物体间关系的图内图。该方法在GQA v1.1上达到57.04%的准确率，创下最先进性能，并在VQA 2.0的计数问题上取得显著提升。

ABSTRACT

The interaction between language and visual information has been emphasized in visual question answering (VQA) with the help of attention mechanism. However, the relationship between words in question has been underestimated, which makes it hard to answer questions that involve the relationship between multiple entities, such as comparison and counting. In this paper, we develop the graph reasoning networks to tackle this problem. Two kinds of graphs are investigated, namely inter-graph and intra-graph. The inter-graph transfers features of the detected objects to their related query words, enabling the output nodes to have both semantic and factual information. The intra-graph exchanges information between these output nodes from inter-graph to amplify implicit yet important relationship between objects. These two kinds of graphs cooperate with each other, and thus our resulting model can reason the relationship and dependence between objects, which leads to realization of multi-step reasoning. Experimental results on the GQA v1.1 dataset demonstrate the reasoning ability of our method to handle compositional questions about real-world images. We achieve state-of-the-art performance, boosting accuracy to 57.04%. On the VQA 2.0 dataset, we also receive a promising improvement on overall accuracy, especially on counting problem.

研究动机与目标

为解决现有视觉问答模型在利用问题中词语间语言关系方面不足的问题。
提升对多个视觉实体的推理能力，尤其针对涉及比较、计数或依赖关系的组合性问题。
开发一种基于图的架构，以捕捉视觉物体与问题术语之间的语义和事实关系。
通过在两个互补图中进行结构化消息传递，融合视觉与语言特征，实现多步推理。

提出的方法

跨图从问题词语关注检测到的视觉物体，将视觉特征传递至与问题相关的节点，形成语义基础表示。
图内图在跨图生成的节点之间执行消息传递，实现对视觉实体间关系的推理。
在跨图与图内图中均使用图神经网络，通过邻域聚合迭代优化节点表示。
两个图堆叠并联合训练，使模型能够对视觉与语言输入执行多跳推理。
在两个图中应用注意力机制，动态加权相关特征与关系。
最终预测头聚合优化后的节点表示，生成问题的答案。

实验结果

研究问题

RQ1建模问题中词语间的语言关系是否能提升视觉问答中的推理能力？
RQ2如何在神经网络架构中有效捕捉并推理视觉物体之间的关系？
RQ3双图结构（跨图与图内图）是否相比单注意力机制更能增强多步推理？
RQ4所提方法在组合性问题上的性能提升程度如何，尤其是计数与比较类查询？
RQ5该模型能否在具有复杂视觉关系的真实世界图像上实现泛化？

主要发现

所提出的图推理网络在GQA v1.1数据集上达到最先进准确率57.04%。
该模型在VQA 2.0基准的计数相关问题上表现显著提升，表明组合性推理能力增强。
双图机制有效捕捉了问题中未直接提及的视觉物体间的隐含关系。
图内图组件在放大物体间细微依赖关系方面发挥关键作用，支持多步推理。
该模型在涉及比较与计数的复杂组合性问题上优于现有基于注意力的视觉问答模型。
消融实验确认，跨图与图内图组件均对最终性能有显著贡献。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。