QUICK REVIEW

[论文解读] Image Captioning with Semantic Attention

Quanzeng You, Hailin Jin|arXiv (Cornell University)|Mar 12, 2016

Multimodal Machine Learning Applications参考文献 34被引用 247

一句话总结

提出一种语义注意力模型，在RNN中融合自上而下的CNN特征与自下而上的检测视觉概念以生成图像字幕，在MS-COCO和Flickr30K上达到最先进的结果。

ABSTRACT

Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which come up with words describing various aspects of an image and then combine them. In this paper, we propose a new algorithm that combines both approaches through a model of semantic attention. Our algorithm learns to selectively attend to semantic concept proposals and fuse them into hidden states and outputs of recurrent neural networks. The selection and fusion form a feedback connecting the top-down and bottom-up computation. We evaluate our algorithm on two public benchmarks: Microsoft COCO and Flickr30K. Experimental results show that our algorithm significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.

研究动机与目标

通过利用语义注意力，弥合自上而下和自下而下字幕方法之间的差距。
开发一个在字幕生成期间对语义上有意义的概念进行注意并与全局图像特征融合的模型。
在标准基准上展示改进的字幕质量并分析注意力行为和属性预测。

提出的方法

从CNN提取全局视觉特征以及从图像中检测到的一组视觉属性（A_i）。
使用LSTM/RNN生成字幕，输入注意力机制（alpha_t^i）在前一词的条件下选择属性。
引入输出注意力机制（beta_t^i），在所关注的属性和当前RNN状态的条件下进行词预测。
通过双线性/嵌入投影计算注意力分数，以产生属性嵌入的加权输入和输出的总和，并与循环状态整合。
使用负对数似然目标加上促进属性的完整和稀疏注意力的正则项(g(alpha), g(beta))进行端到端训练。
通过非参数方法（使用弱标注图像的k-NN）和参数方法（排序损失多标签分类器和完全卷积网络）预测属性。

实验结果

研究问题

RQ1语义注意力在检测到的视觉概念上能否改进图像字幕，超越纯自上而下或自下而上的方法？
RQ2应如何设计输入和输出对属性的注意，以最好地影响RNN状态更新和词预测？
RQ3采用不同属性预测策略（k-NN、RK、FCN）对字幕质量有何影响？
RQ4将全局特征与语义性关注属性结合，是否在标准指标（BLEU、METEOR、ROUGE-L、CIDEr）上得到更好的结果？

主要发现

语义注意力模型在MS-COCO和Flickr30K上在多个指标上显著优于最先进的方法。
通过输入和输出机制对视觉属性的注意显著改善字幕质量，结合效果最好。
基于FCN的属性预测在对字幕性能的影响上比排名损失或k-NN方法更具鲁棒属性预测。
在融合策略中，使用前3个被关注的属性通常能获得最佳性能，相较于简单的最大值或串联。
真实属性提供了上限，并显示出潜在显著提升，表明属性质量强烈影响字幕质量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。