QUICK REVIEW

[论文解读] An Empirical Study on Leveraging Scene Graphs for Visual Question Answering

Cheng Zhang, Wei‐Lun Chao|arXiv (Cornell University)|Jul 28, 2019

Multimodal Machine Learning Applications参考文献 78被引用 31

一句话总结

本文研究了利用场景图——图像中对象及其关系的结构化表示——结合图网络（GNs）进行视觉问答（VQA）的方法。结果表明，GNs 能够在场景图上执行结构化推理，在 VQA 基准测试中达到最先进性能，且模型架构更简洁，同时实现了可解释的注意力机制，能够突出推理过程中相关节点和边。

ABSTRACT

Visual question answering (Visual QA) has attracted significant attention these years. While a variety of algorithms have been proposed, most of them are built upon different combinations of image and language features as well as multi-modal attention and fusion. In this paper, we investigate an alternative approach inspired by conventional QA systems that operate on knowledge graphs. Specifically, we investigate the use of scene graphs derived from images for Visual QA: an image is abstractly represented by a graph with nodes corresponding to object entities and edges to object relationships. We adapt the recently proposed graph network (GN) to encode the scene graph and perform structured reasoning according to the input question. Our empirical studies demonstrate that scene graphs can already capture essential information of images and graph networks have the potential to outperform state-of-the-art Visual QA algorithms but with a much cleaner architecture. By analyzing the features generated by GNs we can further interpret the reasoning process, suggesting a promising direction towards explainable Visual QA.

研究动机与目标

探索结构化场景图是否能在端到端神经网络之外提升视觉问答性能。
评估图网络（GNs）在场景图上执行结构化推理以提升 VQA 效果的有效性。
分析场景图质量以及节点/边特征对 VQA 性能的影响。
通过可视化图组件上的注意力，实现 VQA 中可解释的推理。
比较机器生成的场景图（如 Visual Genome、Neural Motifs）与人工标注的场景图在 VQA 任务中的表现。

提出的方法

作者将图像表示为场景图，其中节点表示对象，边表示对象之间的关系。
他们采用图网络（GNs）对场景图进行编码，并基于消息传递机制在节点和边上执行推理。
GN 模型使用 LSTM 编码器处理问题，并将问题编码特征与图编码特征融合以预测答案。
他们实验了多种输入组合：图像特征（i）、问题（q）和候选答案（c），以及场景图输入（S）。
通过跟踪节点和边更新的 ℓ₂ 范数来分析注意力机制，可视化图中与问题最相关的部分。
他们比较了多种场景图来源：Visual Genome（VG）、Neural Motifs（NM）以及无图（NG），并评估了节点名称和属性的影响。

实验结果

研究问题

RQ1从图像中提取的场景图是否能提升 VQA 性能，相比标准深度学习模型？
RQ2自动生成的场景图质量如何影响 VQA 准确率？
RQ3图网络是否能对场景图执行结构化推理，以提升 VQA 的可解释性？
RQ4哪些类型的问题（如“什么”、“多少”、“哪里”）最受益于场景图推理？
RQ5引入节点属性或关系是否能提升特定类型问题的推理性能？

主要发现

使用 Visual Genome 图并包含节点名称和属性的 VG(N, A) 模型在 VQA 基准测试中达到 62.6% 的整体准确率，显著优于无图基线（43.3%）。
对于“什么”、“谁”和“数量”类问题，加入节点名称的场景图使准确率相比无图基线提升了 10–20 个百分点。
“颜色”类问题在引入节点属性后获得了最大的相对提升，准确率提高超过 10%。
使用 VG 图并包含节点名称和属性的模型（VG(N, A)）在“为什么”类问题上达到 85.3% 的准确率，为所有配置中的最高值。
定性分析显示，基于 GN 的模型能隐式关注相关节点和边（如“风筝”、“拿着”、“比赛”），揭示出可解释的推理路径。
Neural Motifs 图失败的案例归因于缺乏节点属性，凸显了丰富图特征的重要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。