QUICK REVIEW

[论文解读] Scene Graph Reasoning with Prior Visual Relationship for Visual Question Answering

Zhuoqian Yang, Zengchang Qin|arXiv (Cornell University)|Dec 23, 2018

Multimodal Machine Learning Applications参考文献 57被引用 26

一句话总结

该论文提出 SceneGCN，一种基于场景图的视觉问答模型，通过利用先前的视觉关系表征来增强关系推理。通过将物体和关系编码到深层语义空间，并使用问题引导注意力的图卷积网络，该模型在 GQA 基准测试中实现了 54.56% 的最先进准确率，相较于先前方法在推理能力和可解释性方面均有提升。

ABSTRACT

One of the key issues of Visual Question Answering (VQA) is to reason with semantic clues in the visual content under the guidance of the question, how to model relational semantics still remains as a great challenge. To fully capture visual semantics, we propose to reason over a structured visual representation - scene graph, with embedded objects and inter-object relationships. This shows great benefit over vanilla vector representations and implicit visual relationship learning. Based on existing visual relationship models, we propose a visual relationship encoder that projects visual relationships into a learned deep semantic space constrained by visual context and language priors. Upon the constructed graph, we propose a Scene Graph Convolutional Network (SceneGCN) to jointly reason the object properties and relational semantics for the correct answer. We demonstrate the model's effectiveness and interpretability on the challenging GQA dataset and the classical VQA 2.0 dataset, remarkably achieving state-of-the-art 54.56% accuracy on GQA compared to the existing best model.

研究动机与目标

通过显式建模个体物体之外的视觉关系，提升视觉问答性能。
解决现有 VQA 模型中关系推理隐式或弱监督的局限性。
将预训练视觉关系检测模型中的先验知识整合到结构化场景图中，以提升推理能力。
开发一种可微分、可解释的推理机制，逐步识别相关的关系与物体。

提出的方法

该模型使用预训练的目标检测器和视觉关系编码器构建场景图，生成受上下文和语言约束的关系嵌入。
场景图卷积网络（SceneGCN）在场景图上执行消息传递，利用物体特征和关系特征联合更新节点表示。
场景图卷积单元通过问题-关系引导的自注意力机制，动态加权关系的重要性，以反映其与问题的相关性。
问题引导的对象注意力模块通过关注关系感知表示，识别最相关的物体，实现渐进式推理。
视觉关系编码器同时利用视觉上下文和语言先验进行训练，生成类型感知、判别性强的关系嵌入。
整个模型端到端可训练，注意力机制通过局部化的推理轨迹提供可解释性。

实验结果

研究问题

RQ1先验视觉关系表征是否能提升视觉问答中的推理性能？
RQ2如何有效编码并整合视觉关系到神经网络中用于 VQA？
RQ3基于图的架构结合注意力机制是否能实现场景图上的渐进式、可解释推理？
RQ4整合结构化关系知识是否能提升组合式 VQA 基准上的泛化能力与准确率？

主要发现

所提出的 SceneGCN 模型在具有挑战性的 GQA 数据集上实现了最先进性能，top-1 准确率达到 54.56%。
消融实验表明，引入先验视觉关系表征能显著提升推理性能，优于缺乏此类先验的模型。
该模型展现出强大的可解释性，注意力图能根据问题清晰定位相关的关系与物体。
定性结果表明，模型执行渐进式推理：首先识别关键关系，随后聚焦于对答案预测至关重要的物体。
问题引导的对象注意力机制成功突出最相关的物体，例如在基于关系的推理链中识别出 'beef'。
视觉关系编码器生成了判别性强、类型感知的嵌入，显著提升了下游推理性能，该结论通过注意力可视化与消融实验得到验证。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。