QUICK REVIEW

[论文解读] Exploring Visual Relationship for Image Captioning

Ting Yao, Yingwei Pan|arXiv (Cornell University)|Sep 19, 2018

Multimodal Machine Learning Applications参考文献 37被引用 49

一句话总结

简要：引入 GCN-LSTM，一种基于图卷积网络的编码器，利用检测对象之间的语义和空间关系来提升图像描述生成；在 COCO 上实现了 CIDEr-D 的最先进性能。

ABSTRACT

It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.

研究动机与目标

在字幕生成中利用对象关系实现更丰富的图像理解。
提出一个关系感知的图像编码器，整合语义和空间图。
展示在 COCO 上通过基于图的注意力解码实现的改进字幕生成性能。

提出的方法

使用 Faster R-CNN 检测对象，形成区域集合 V。
在检测出的区域上构建带有方向边和标签的语义和空间图。
通过带边门控的带标签有向 GCN 对区域特征进行细化。
使用两个基于注意力的 LSTM 解码器（每个图各一个）来生成描述。
通过对两个解码器的词概率线性组合进行晚融合来融合输出。

实验结果

研究问题

RQ1对象之间的语义和空间关系是否在区域级注意力之外进一步提升图像描述的质量？
RQ2在关系图上的 GCN 能否为字幕生成产生更具信息性的区域表示？
RQ3结合语义和空间关系信号对描述质量的影响是多少？

主要发现

模型	B@1	B@4	M	R	C	S
GCN-LSTM (Cross-Entropy)	77.4	37.1	28.1	57.2	117.1	21.1
GCN-LSTM (CIDEr-D Optimized)	80.9	38.3	28.6	58.5	128.7	22.1

GCN-LSTM 变体在 COCO 的多项指标上超过基线方法（LSTM、Up-Down、SCST、ADP-ATT）。
在 CIDEr-D 优化下，GCN-LSTM 达到 128.7 CIDEr-D 和 22.1 SPICE，显著超越前一方法。
在晚融合中同时使用语义和空间图，相较单一图变体带来进一步提升。
在 COCO 在线测试中，GCN-LSTM 在 c5 和 c40 参考上均达到最佳性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。