QUICK REVIEW

[论文解读] Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

Abhishek Das, Satwik Kottur|arXiv (Cornell University)|Mar 20, 2017

Multimodal Machine Learning Applications参考文献 31被引用 91

一句话总结

这篇论文通过 Q-bot 和 A-bot 之间的合作图像猜测游戏，提出面向目标的视觉问答与对话训练，并通过端到端深度强化学习学习。

ABSTRACT

We introduce the first goal-driven training for visual question answering and dialog agents. Specifically, we pose a cooperative 'image guessing' game between two agents -- Qbot and Abot -- who communicate in natural language dialog so that Qbot can select an unseen image from a lineup of images. We use deep reinforcement learning (RL) to learn the policies of these agents end-to-end -- from pixels to multi-agent multi-round dialog to game reward. We demonstrate two experimental results. First, as a 'sanity check' demonstration of pure RL (from scratch), we show results on a synthetic world, where the agents communicate in ungrounded vocabulary, i.e., symbols with no pre-specified meanings (X, Y, Z). We find that two bots invent their own communication protocol and start using certain symbols to ask/answer about certain visual attributes (shape/color/style). Thus, we demonstrate the emergence of grounded language and communication among 'visual' dialog agents with no human supervision. Second, we conduct large-scale real-image experiments on the VisDial dataset, where we pretrain with supervised dialog data and show that the RL 'fine-tuned' agents significantly outperform SL agents. Interestingly, the RL Qbot learns to ask questions that Abot is good at, ultimately resulting in more informative dialog and a better team.

研究动机与目标

激励开发可以理解并讨论图像的视觉为基础的对话式AI。
提出一个两代理协作设置，其中一个代理提问，另一个回答，以识别一个未见的图像。
展示端到端深度RL能够让语言具备 grounding，并提升对话质量，相比受监督基线。

提出的方法

将一个合作图像猜测游戏形式化，Q-bot（提问者）和 A-bot（回答者）。
将对话表示为离散的自然语言标记，并通过特征回归网络将预测地面化到图像嵌入。
使用端到端深度RL（REINFORCE）来训练两位代理和地面化预测，从像素到多轮对话的奖励。
为 Q-bot 和 A-bot 提供两级分层的编码器-解码器策略，具有共享的标记词汇表。
通过最大化图像表示预测的改进，从纯监督学习转向目标驱动优化。
在监督的 VisDial 数据上进行预训练，然后用 RL 微调以提升性能。

实验结果

研究问题

RQ1两个协作对话代理是否能够在没有人类监督的情况下学习用于视觉定位的 grounding 通信？
RQ2在监督预训练之后的强化学习是否比纯监督对话带来更好的图像猜测性能？
RQ3代理应如何构建提问和回答，以最大化关于未见图像的信息增益？

主要发现

在一个合成的、未 grounding 的设定中，代理发明了自己的语言映射，将符号与属性相关联。
在真实图像（VisDial）上，RL 微调的代理在图像定位任务中优于监督基线。
经过 RL 训练的 Q-bot 学习的提问策略与 A-bot 的强项对齐，产生更有信息量的对话和更好的团队表现。
即使感知不完美， grounded 语言也通过交互端到端涌现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。