QUICK REVIEW

[论文解读] Using Syntax to Ground Referring Expressions in Natural Images

Volkan Cirik, Taylor Berg-Kirkpatrick|arXiv (Cornell University)|May 26, 2018

Multimodal Machine Learning Applications被引用 35

一句话总结

GroundNet 是一种基于句法的神经网络，利用句法解析树为图像中的指代表达构建动态计算图。通过将句法成分映射到神经模块，它提升了目标对象和支撑对象的定位能力，在支撑对象检测任务上达到最先进性能，同时保持了高目标定位准确率，从而增强了模型的可解释性。

ABSTRACT

We introduce GroundNet, a neural network for referring expression recognition -- the task of localizing (or grounding) in an image the object referred to by a natural language expression. Our approach to this task is the first to rely on a syntactic analysis of the input referring expression in order to inform the structure of the computation graph. Given a parse tree for an input expression, we explicitly map the syntactic constituents and relationships present in the tree to a composed graph of neural modules that defines our architecture for performing localization. This syntax-based approach aids localization of extit{both} the target object and auxiliary supporting objects mentioned in the expression. As a result, GroundNet is more interpretable than previous methods: we can (1) determine which phrase of the referring expression points to which object in the image and (2) track how the localization of the target object is determined by the network. We study this property empirically by introducing a new set of annotations on the GoogleRef dataset to evaluate localization of supporting objects. Our experiments show that GroundNet achieves state-of-the-art accuracy in identifying supporting objects, while maintaining comparable performance in the localization of target objects.

研究动机与目标

通过利用自然语言表达中的句法结构，提升指代表达定位任务的可解释性。
解决先前模型在定位对消歧至关重要的支撑对象方面的局限性。
开发一种反映指代表达递归性与组合性的动态神经架构。
提出一种新的支撑对象标注方案，以支持对中间定位决策的评估。
证明句法组合性能够同时提升视觉-语言定位任务中的可解释性与性能。

提出的方法

该模型基于指代表达的句法解析树构建动态计算图，将每个句法成分映射到一个神经模块。
计算图中的每个节点对应一个神经模块，用于定位图像中的对象，其操作包括定位和关系推理。
网络以自底向上的方式处理图结构，从名词短语和介词短语开始，逐步构建至完整表达。
句法成分（如名词短语 NPs 和介词短语 PPs）被显式映射到检测对象及其空间关系的模块。
该架构具有可解释性：每个模块的输出均可追溯至文本中的特定短语，以确定其对应图像中的哪个对象。
模型通过仅使用目标对象标注进行端到端训练，无需支撑对象的真实边界框。

实验结果

研究问题

RQ1句法组合性是否能提升指代表达定位中支撑对象的定位性能？
RQ2基于句法的神经架构是否通过支持对语言成分的可追溯推理，增强模型的可解释性？
RQ3基于解析树的动态计算图是否在定位复杂、递归的指代表达方面优于固定结构模型？
RQ4指代表达模型中是否存在准确率与可解释性之间的权衡，能否通过引入句法结构加以缓解？
RQ5当前最先进模型在多大程度上未能有效定位支撑对象，这种缺陷能否被定量衡量？

主要发现

GroundNet 在 GoogleRef 数据集上实现了支撑对象定位的最先进性能，优于先前模型。
尽管引入了新的辅助任务，该模型在目标对象定位上的准确率与最先进方法相当。
通过新标注的支撑对象位置进行的实证评估证实，先前模型在有效定位支撑对象方面表现不佳。
基于句法的计算图实现了完全可解释性：每个模块的输出均可追溯至指代表达中的特定短语。
该模型成功定位了涉及多个支撑对象的递归表达，例如 '离咖啡杯最近的盘子'。
结果表明，句法组合性是提升视觉-语言定位任务中可解释性与性能的关键因素。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。