QUICK REVIEW

[论文解读] Image Captioning at Will: A Versatile Scheme for Effectively Injecting Sentiments into Image Descriptions

Quanzeng You, Hailin Jin|arXiv (Cornell University)|Jan 30, 2018

Multimodal Machine Learning Applications参考文献 31被引用 47

一句话总结

本文提出两种端到端模型，将情感注入图像标题，在不牺牲视觉-语义对齐的前提下实现可控的正/负情感标题，并展示其在情感标题任务上优于现有方法。

ABSTRACT

Automatic image captioning has recently approached human-level performance due to the latest advances in computer vision and natural language understanding. However, most of the current models can only generate plain factual descriptions about the content of a given image. However, for human beings, image caption writing is quite flexible and diverse, where additional language dimensions, such as emotion, humor and language styles, are often incorporated to produce diverse, emotional, or appealing captions. In particular, we are interested in generating sentiment-conveying image descriptions, which has received little attention. The main challenge is how to effectively inject sentiments into the generated captions without altering the semantic matching between the visual content and the generated descriptions. In this work, we propose two different models, which employ different schemes for injecting sentiments into image captions. Compared with the few existing approaches, the proposed models are much simpler and yet more effective. The experimental results show that our model outperform the state-of-the-art models in generating sentimental (i.e., sentiment-bearing) image captions. In addition, we can also easily manipulate the model by assigning different sentiments to the testing image to generate captions with the corresponding sentiments.

研究动机与目标

动机：需要超越事实描述的具情感感知的图像标题。
提出端到端模型，将情感注入标题生成，同时不降低图像-文本对齐。
通过对标题进行显式情感标签条件化，实现可控的情感生成。
证明具情感感知的模型在情感标题任务上优于现有最先进基线。

提出的方法

Direct Injection: 直接注入：在每个生成步骤将一个情感单元 (-1,0,1) 连接到 RNN 输入，以偏置词汇选择。
Sentiment Flow: 引入一个情感单元，通过 LSTM 传播初始情感信号，并用情感损失引导最终情感状态。
Train end-to-end using MS-COCO plus SentiCap data, with sentiment labels and optional sentiment loss.
Use ResNet-152 as CNN encoder and a 256-d embedding with a 512-d RNN, trained with Adam optimizer.

实验结果

研究问题

RQ1在保持与图像语义对应的前提下，情感是否可以注入到图像标题中？
RQ2哪种架构方案（Direct Injection 还是 Sentiment Flow）更能支持可控的情感标题？
RQ3引入情感损失是否提升模型在整段标题序列中区分并传播情感的能力？
RQ4在正例与负例中，模型在匹配给定情感标签的标题方面表现如何？

主要发现

两种提出的模型在情感标题基准测试中，基于标准指标均优于引用的基线。
Direct Injection 每步产生更强的情感信号，情感标题的比例更高，尤其是负向标题。
Sentiment Flow 在 POS 与 NEG 集上提供平衡的性能，并在多种配置中受益于情感损失。
这些模型通过在测试时翻转情感标签来实现可控生成，产生在图像内容中分布着匹配情感词的标题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。