QUICK REVIEW

[论文解读] Real-Time Referring Expression Comprehension by Single-Stage Grounding Network

Xinpeng Chen, Lin Ma|arXiv (Cornell University)|Dec 8, 2018

Multimodal Machine Learning Applications被引用 63

一句话总结

SSG 提供一个端到端的单阶段模型，在没有区域提议的情况下本地化图像中的指称表达，达到有竞争力的准确性和实时速度，包括在 ReferItGame 上的最新效果以及在 GPU 下 RefCOCO 每秒 40 个指代的速度。

ABSTRACT

In this paper, we propose a novel end-to-end model, namely Single-Stage Grounding network (SSG), to localize the referent given a referring expression within an image. Different from previous multi-stage models which rely on object proposals or detected regions, our proposed model aims to comprehend a referring expression through one single stage without resorting to region proposals as well as the subsequent region-wise feature extraction. Specifically, a multimodal interactor is proposed to summarize the local region features regarding the referring expression attentively. Subsequently, a grounder is proposed to localize the referring expression within the given image directly. For further improving the localization accuracy, a guided attention mechanism is proposed to enforce the grounder to focus on the central region of the referent. Moreover, by exploiting and predicting visual attribute information, the grounder can further distinguish the referent objects within an image and thereby improve the model performance. Experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate that our proposed SSG without relying on any region proposals can achieve comparable performance with other advanced models. Furthermore, our SSG outperforms the previous models and achieves the state-of-art performance on the ReferItGame dataset. More importantly, our SSG is time efficient and can ground a referring expression in a 416*416 image from the RefCOCO dataset in 25ms (40 referents per second) on average with a Nvidia Tesla P40, accomplishing more than 9* speedups over the existing multi-stage models.

研究动机与目标

推动无需区域提议的实时指称表达定位。
提出一个端到端的单阶段定位网络（SSG），具备多模态编码、交互器和定位器。
结合引导注意力和属性预测以提升定位准确性。
在标准数据集上展示效率和有竞争力的准确性（RefCOCO、RefCOCO+、RefCOCOg、ReferItGame）。

提出的方法

使用基于 YOLO-v3 的主干对图像进行编码，以获得局部区域特征。
使用带有 EMLo 嵌入的两层 Bi-LSTM 对指称表达进行编码。
使用带注意力的多模态交互器生成图像-文本联合表示。
通过从联合表示直接预测边界框和置信度分数来定位表达。
应用辅助损失：定位（MSE）、置信度（二元交叉熵）、引导注意力（中心偏置）以及属性预测（多标签）。
用损失加权和进行训练，并在推理阶段仅启用定位模块。

实验结果

研究问题

RQ1端到端的单阶段模型在没有区域提议的情况下是否也能达到有竞争力的定位准确率？
RQ2面向指称中心的引导注意力是否能提高定位？
RQ3辅助属性预测是否能进一步区分指称对象并提升准确性？
RQ4单阶段方法是否在计算上足够高效，能够在标准数据集上实现实时定位？

主要发现

SSG 在 RefCOCO、RefCOCO+ 和 RefCOCOg 上在没有区域提议的情况下取得有竞争力的结果。
SSG 在 ReferItGame 数据集上达到最先进的性能。
在 GPU 加速下，SSG 在 RefCOCO（416×416 输入）上约以每秒 40 个指代运行。
消融研究表明加入置信度、引导注意力和属性预测损失会带来改进。
推理时间显著快于多阶段方法，展示了实时定位能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。