QUICK REVIEW

[论文解读] Gated-Attention Architectures for Task-Oriented Language Grounding

Devendra Singh Chaplot, Kanthashree Mysore Sathyendra|arXiv (Cornell University)|Jun 22, 2017

Multimodal Machine Learning Applications被引用 99

一句话总结

引入端到端的门控注意力多模态融合，用于在3D环境中对自然语言进行定位，并通过强化学习和模仿学习来学习策略。GA单元在多任务和零样本泛化方面优于拼接方法。

ABSTRACT

To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called task-oriented language grounding. We propose an end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input. The proposed model combines the image and text representations using a Gated-Attention mechanism and learns a policy to execute the natural language instruction using standard reinforcement and imitation learning methods. We show the effectiveness of the proposed model on unseen instructions as well as unseen maps, both quantitatively and qualitatively. We also introduce a novel environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states.

研究动机与目标

使用原始像素输入和自然语言指令开发一个端到端的任务导向语言定位架构。
提出一种新颖的门控注意力融合机制，用于结合视觉和语言表征。
通过强化学习和模仿学习训练策略，以在3D环境中执行指令。
在基于 ViZDoom 的 Doom 风格环境中，展示对未见指令和未见地图的泛化能力。

提出的方法

通过CNN处理图像以获得 x_I，并通过GRU处理指令以获得 x_L。
通过新颖的门控注意力单元 M_GA(x_I, x_L) 来融合模态，该单元通过从 x_L 导出的 sigmoid 注意力向量对卷积特征图进行门控。
将 GA 融合与基线拼接融合 M_concat(x_I, x_L) 进行比较。
使用带熵正则化和广义优势估计的 A3C（强化学习）来训练策略；或用于模仿学习的行为克隆/DAgger。
利用具有第一人称视角的基于 Doom 的 ViZDoom 环境和一个70条指令集来评估多任务和零样本泛化。

实验结果

研究问题

RQ1门控注意力多模态融合是否能改进在3D环境中将自然语言定位到视觉元素的能力？
RQ2与拼接相比，GA 融合是否能在未见指令和未见地图上实现更好的泛化？
RQ3在具备 GA 融合的任务设置中，强化学习和模仿学习的表现有何差异？
RQ4注意力图对在不同指令下的属性/对象定位揭示了什么？

主要发现

模型	参数	简单	中等	困难	多任务	零样本
BC Concat	5.21M	0.86	0.71	0.23	0.15	0.20	0.15
BC GA	5.09M	0.97	0.81	0.30	0.23	0.36	0.29
DAgger Concat	5.21M	0.92	0.73	0.45	0.23	0.19	0.13
DAgger GA	5.09M	0.94	0.85	0.55	0.40	0.29	0.30
A3C Concat	3.44M	1.00	0.80	0.80	0.54	0.24	0.12
A3C GA	3.39M	1.00	0.81	0.89	0.75	0.83	0.73

GA 单元在所有难度模式中都在多任务和零样本泛化方面优于拼接单元。
在困难模式下，GA 与 A3C 达到 83% MT 和 73% ZSL，而 Concat 为 24% MT 和 12% ZSL。
在模仿学习（BC/DAgger）下，GA 模型也优于 Concat，尽管在更难的模式中探索对 IL 有影响。
注意力可视化显示与颜色、对象类型等属性相对应的维度特定门控，表明对指令属性的定位取得成功。
在所报告的设置中，A3C GA 模型学习更快并收敛到比 A3C Concat 更高的准确度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。