QUICK REVIEW

[論文レビュー] Gated-Attention Architectures for Task-Oriented Language Grounding

Devendra Singh Chaplot, Kanthashree Mysore Sathyendra|arXiv (Cornell University)|Jun 22, 2017

Multimodal Machine Learning Applications被引用数 99

ひとこと要約

エンドツーエンドのゲート付き注意機構を用いたマルチモーダル融合で3D環境内の自然言語を grounding し、RLとILで方針を学習。GA ユニットは multitask および zero-shot generalization で concatenation より優れている。

ABSTRACT

To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called task-oriented language grounding. We propose an end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input. The proposed model combines the image and text representations using a Gated-Attention mechanism and learns a policy to execute the natural language instruction using standard reinforcement and imitation learning methods. We show the effectiveness of the proposed model on unseen instructions as well as unseen maps, both quantitatively and qualitatively. We also introduce a novel environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states.

研究の動機と目的

raw pixel 入力と自然言語指示を用いたタスク指向言語 grounding のエンドツーエンドアーキテクチャを開発する。
視覚表現と言語表現を組み合わせる新規の Gated-Attention 融合メカニズムを提案する。
3D環境で指示を実行するために強化学習と模倣学習でポリシーを訓練する。
ViZDoom ベースの Doom ライク設定で見知らぬ指示と見知らぬマップへの一般化を実証する。

提案手法

画像を CNN を介して処理し x_I を取得し、GRU を介して指示を処理して x_L を取得する。
新規の GA(M_GA(x_I, x_L)) ユニットでモダリティを融合し、x_L から導出されるシグモイド注意ベクトルで畳み込み特徴マップをゲートする。
GA 融合をベースラインの連結融合 M_concat(x_I, x_L) と比較する。
ポリシーを A3C（強化学習）とエントロピー正則化および Generalized Advantage Estimation で訓練する；または模倣学習の Behavioral Cloning/DAgger を用いる。
Doom 系 ViZDoom 環境を第一人称視点で用い、マルチタスクとゼロショット一般化を評価するための 70 指示セットを使用する。

実験結果

リサーチクエスチョン

RQ1ゲート付き注意を用いたマルチモーダル融合は、3D 環境で自然言語を視覚要素へ grounding するのを改善できるか。
RQ2GA 融合は連結と比較して未学習の指示や未学習マップへの一般化を向上させるか。
RQ3このタスク設定における GA 融合で、強化学習と模倣学習はどのように比較されるか。
RQ4注意マップはさまざまな指示の下で属性/オブジェクト grounding に関して何を示唆するか。

主な発見

Model	Parameters	Easy	Medium	Hard	MT	ZSL
BC Concat	5.21M	0.86	0.71	0.23	0.15	0.20	0.15
BC GA	5.09M	0.97	0.81	0.30	0.23	0.36	0.29
DAgger Concat	5.21M	0.92	0.73	0.45	0.23	0.19	0.13
DAgger GA	5.09M	0.94	0.85	0.55	0.40	0.29	0.30
A3C Concat	3.44M	1.00	0.80	0.80	0.54	0.24	0.12
A3C GA	3.39M	1.00	0.81	0.89	0.75	0.83	0.73

GA ユニットはすべての難易度モードでマルチタスクとゼロショット一般化において連結ユニットを上回る。
Hard モードでは GA with A3C が 83% MT と 73% ZSL、対して Concat は 24% MT と 12% ZSL。
GA モデルは模倣学習（BC/DAgger）にもおいて Concat を上回るが、 harder モードでは探索が IL に影響を与える。
注意の可視化は、色やオブジェクトタイプなどの属性に対応した次元特異的ゲートを示し、指示された属性の grounding が成功していることを示唆する。
A3C GA モデルは報告された設定で A3C Concat より学習が速く、精度収束が高い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。