QUICK REVIEW

[论文解读] Deep Reinforcement Learning-based Image Captioning with Embedding Reward

Zhou Ren, Xiaoyu Wang|arXiv (Cornell University)|Apr 12, 2017

Multimodal Machine Learning Applications参考文献 44被引用 64

一句话总结

本文提出一个通过演员-评论家强化学习训练的策略网络和价值网络，利用视觉-语义嵌入作为奖励，在 MS COCO 上实现了最先进的图像描述生成，同时提出了一种解码时结合局部与全局引导的前瞻推理。

ABSTRACT

Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state-of-the-art approaches follow an encoder-decoder framework, which generates captions using a sequential recurrent prediction model. However, in this paper, we introduce a novel decision-making framework for image captioning. We utilize a "policy network" and a "value network" to collaboratively generate captions. The policy network serves as a local guidance by providing the confidence of predicting the next word according to the current state. Additionally, the value network serves as a global and lookahead guidance by evaluating all possible extensions of the current state. In essence, it adjusts the goal of predicting the correct words towards the goal of generating captions similar to the ground truth captions. We train both networks using an actor-critic reinforcement learning model, with a novel reward defined by visual-semantic embedding. Extensive experiments and analyses on the Microsoft COCO dataset show that the proposed framework outperforms state-of-the-art approaches across different evaluation metrics.

研究动机与目标

将图像描述建模为一个具有局部与全局引导的决策过程。
开发一个策略网络与一个价值网络来协同生成描述。
定义基于视觉-语义嵌入的奖励以用于强化学习。
使用演员- critic 框架训练，以优化描述与图像在各指标上的相似度。

提出的方法

将图像描述建模为一个序列决策过程，其状态由图像与已生成的词组成。
使用一个策略网络（CNN + RNN）来预测下一个词，使用一个价值网络（CNN + RNN + MLP）来评估未来奖励。
定义奖励为在视觉-语义嵌入空间中生成的描述与图像的嵌入相似度。
先进行策略的交叉熵预训练和价值的均方误差预训练，然后通过演员- critic 强化学习进行联合训练。
引入在解码过程中结合策略（局部）与价值（全局）引导的前瞻推理，使用一个可调的 λ 来平衡两者。

实验结果

研究问题

RQ1嵌入式奖励是否在多指标上提升描述质量，相较于标准的监督学习？
RQ2在使用策略和价值网络时，前瞻推理对解码有何影响？
RQ3在所提框架中，强化学习相较于基线方法的影响为何？
RQ4为何将价值网络设计为独立的视觉与语义流，而不是仅使用策略隐层状态？
RQ5结果对超参数如 λ 与束宽（beam size）的敏感性如何？

主要发现

所提出的方法在 MS COCO 的 BLEU-1、BLEU-2、BLEU-3、BLEU-4、METEOR、ROUGE-L、CIDEr 上均达到最先进的性能。
嵌入驱动的演员-critic 学习在不增加额外外部数据的情况下提升了跨指标的泛化能力。
将策略引导与价值引导相结合的前瞻推理显著优于标准束搜索和基线，提升了描述质量。
完整模型在大多数指标上优于基线，CIDEr 达到 0.937。
超参数分析表明最佳结果出现在 λ 约为 0.4 且束宽中等的情形。
仅价值网络或仅策略的变体均不及完整模型，凸显两者结合的必要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。