QUICK REVIEW

[论文解读] Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects

Ting Yao, Yingwei Pan|arXiv (Cornell University)|Aug 17, 2017

Multimodal Machine Learning Applications参考文献 28被引用 33

一句话总结

本文提出LSTM-C，一种新颖的图像字幕生成框架，通过在CNN-RNN架构中引入复制机制，以描述训练过程中未见过的新物体。通过利用外部数据集上的预训练目标检测器，LSTM-C可将物体名称直接复制到生成的字幕中，在MSCOCO和ImageNet数据集上实现了最先进性能，新物体字幕描述的准确率相对提升了17.8%。

ABSTRACT

Image captioning often requires a large set of training image-sentence pairs. In practice, however, acquiring sufficient training pairs is always expensive, making the recent captioning models limited in their ability to describe objects outside of training corpora (i.e., novel objects). In this paper, we present Long Short-Term Memory with Copying Mechanism (LSTM-C) --- a new architecture that incorporates copying into the Convolutional Neural Networks (CNN) plus Recurrent Neural Networks (RNN) image captioning framework, for describing novel objects in captions. Specifically, freely available object recognition datasets are leveraged to develop classifiers for novel objects. Our LSTM-C then nicely integrates the standard word-by-word sentence generation by a decoder RNN with copying mechanism which may instead select words from novel objects at proper places in the output sentence. Extensive experiments are conducted on both MSCOCO image captioning and ImageNet datasets, demonstrating the ability of our proposed LSTM-C architecture to describe novel objects. Furthermore, superior results are reported when compared to state-of-the-art deep models.

研究动机与目标

解决现有图像字幕模型在描述训练数据中不存在的物体（即新物体）方面的局限性。
将来自免费可用的目标识别数据集的知识整合到字幕生成流程中，以提升对未见物体的泛化能力。
开发一个端到端可训练的框架，结合LSTM的序列生成与物体名称的复制机制。
证明复制机制能显著提升新物体字幕生成的性能，尤其是在结合外部文本数据时。

提出的方法

该框架使用CNN从输入图像中提取视觉特征，随后将这些特征输入LSTM解码器以生成句子。
通过在外部数据集（如ImageNet）上使用预训练模型进行目标检测，生成图像中候选物体的列表。
在LSTM解码器顶部引入一个复制层，使模型能够将检测到的物体名称直接复制到输出句子中。
通过软注意力机制将复制机制集成，该机制在词汇表和检测到的物体之间计算概率分布，其中包含一个可学习的权衡参数λ。
通过交叉熵损失进行端到端训练，复制机制通过可微分的路径将检测到的物体的词语路由到输出。
使用外部未配对的文本数据（如BNC和Wikipedia）预训练词嵌入，以提升泛化能力和性能。

Figure 1: An example of object recognition and image captioning. The input is an image, while the output is the detected objects and a natural sentence, respectively. (upper row: the detected objects in the image; middle row: the sentence generated by LRCN [ 4 ] image captioning approach; bottom row

实验结果

研究问题

RQ1复制机制能否提升图像字幕模型描述训练语料中未出现的新物体的能力？
RQ2整合外部目标检测模型如何增强字幕模型对未见物体的泛化能力？
RQ3在字幕生成过程中，从词汇表生成词语与从检测到的物体复制词语之间的最优权衡是什么？
RQ4使用外部未配对的文本数据是否能进一步提升新物体字幕生成的性能？
RQ5该复制机制在不同物体类别中，尤其是与常见物体具有视觉相似性的物体上，表现有多稳健？

主要发现

在MSCOCO数据集上，LSTM-C实现了72.08%的新物体准确率和16.39%的F1分数，分别优于基线NOC模型1.4%和0.76%。
在ImageNet数据集上，LSTM-C相比NOC基线模型实现了17.8%的相对准确率提升，表明其在大规模新物体上的强大泛化能力。
该模型在八种新物体中的六种上取得了最高的F1分数，当λ ≈ 0.2时表现最佳，表明生成与复制之间达到了最佳平衡。
引入外部文本数据（BNC和Wikipedia）进一步提升了性能，当使用one-hot + GloVe嵌入时，ImageNet上的准确率达到31.11%。
定性结果表明，LSTM-C能成功将精确的物体名称（如“bus”而非“hydrant”）复制到字幕中，从而提升语义准确性。

Figure 2: The overview of Long Short-Term Memory with Copying Mechanism (LSTM-C) for describing novel objects (better viewed in color). (a) $\mathcal{W}_{g}$ and $\mathcal{W}_{c}$ are the vocabularies on paired image-sentence dataset and unpaired object recognition dataset, respectively. (b) The ima

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。