QUICK REVIEW

[论文解读] Dynamic Graph Attention for Referring Expression Comprehension

Sibei Yang, Guanbin Li|arXiv (Cornell University)|Sep 18, 2019

Multimodal Machine Learning Applications参考文献 31被引用 24

一句话总结

本文提出动态图注意力（DGA），一种用于指代表达理解的新方法，该方法在图像对象及其关系的动态图上执行多步、语言引导的视觉推理。通过使用可微分分析器建模语言结构，并通过图传播更新复合对象表征，DGA 在三个基准数据集上实现了最先进性能，同时为复杂表达生成可解释的、分步的推理路径。

ABSTRACT

Referring expression comprehension aims to locate the object instance described by a natural language referring expression in an image. This task is compositional and inherently requires visual reasoning on top of the relationships among the objects in the image. Meanwhile, the visual reasoning process is guided by the linguistic structure of the referring expression. However, existing approaches treat the objects in isolation or only explore the first-order relationships between objects without being aligned with the potential complexity of the expression. Thus it is hard for them to adapt to the grounding of complex referring expressions. In this paper, we explore the problem of referring expression comprehension from the perspective of language-driven visual reasoning, and propose a dynamic graph attention network to perform multi-step reasoning by modeling both the relationships among the objects in the image and the linguistic structure of the expression. In particular, we construct a graph for the image with the nodes and edges corresponding to the objects and their relationships respectively, propose a differential analyzer to predict a language-guided visual reasoning process, and perform stepwise reasoning on top of the graph to update the compound object representation at every node. Experimental results demonstrate that the proposed method can not only significantly surpass all existing state-of-the-art algorithms across three common benchmark datasets, but also generate interpretable visual evidences for stepwisely locating the objects referred to in complex language descriptions.

研究动机与目标

解决现有指代表达理解模型缺乏显式多步推理与可解释性的局限性。
通过整合语言结构与对象间视觉关系，提升复杂指代表达的定位性能。
通过统一框架建模语言句法与视觉图结构，实现高层级、组合性推理。
为对象定位背后的推理过程生成可解释的、分步的视觉证据。

提出的方法

构建一个有向视觉图，其中节点表示检测到的对象，边表示对象之间的关系。
引入一个可微分分析器，将指代表达逐步解析为组成部分表达，捕捉语言结构。
在图上执行迭代的、语言引导的推理，通过动态图注意力在每个节点处更新复合对象表征。
在每个推理步骤中，对词语、节点和关系使用软注意力机制，突出显示相关的语言与视觉组件。
通过端到端训练与匹配损失学习联合表征，使表达与最终对象表征对齐。
采用多步推理机制，根据语言引导在图上传播注意力，支持高阶推理。

实验结果

研究问题

RQ1模型能否基于复杂指代表达的语言结构执行多步视觉推理？
RQ2在动态图中整合对象关系如何提升复杂表达的定位准确率？
RQ3能否通过在每一步可视化对词语、节点和关系的注意力，使推理过程变得可解释？
RQ4与固定或启发式解析相比，端到端学习语言结构解析是否能提升性能？
RQ5在指代表达理解中，实现有效且鲁棒定位的最优推理步数是多少？

主要发现

所提出的 DGA 模型在所有三个基准数据集上均达到最先进性能：RefCOCO 验证集 86.34%，RefCOCO 测试集 A 86.64%，RefCOCO 测试集 B 84.79%。
在 RefCOCO+ 上，DGA 在验证集达到 73.56%，在测试集 A 达到 78.31%，在测试集 B 达到 68.15%，优于所有基线模型。
在 RefCOCOg 上，DGA 在验证集达到 80.21%，在测试集达到 80.26%，创下新 SOTA 记录。
消融实验表明，三步推理的 DGA（DGA(3)）表现最佳，表明四步推理会引入噪声。
带有语言解析器的模型（DGA*）表现劣于完整 DGA，证明端到端学习的语言结构解析至关重要。
定性结果表明，DGA 能够生成对词语、节点和关系的可解释注意力图，逐步可视化推理链。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。