QUICK REVIEW

[論文レビュー] Speaker-Follower Models for Vision-and-Language Navigation

Daniel Fried, Ronghang Hu|arXiv (Cornell University)|Jun 7, 2018

Speech and dialogue systems被引用数 244

ひとこと要約

この論文は panoramic high-level action space を備えた speaker-follower フレームワークを導入し、speaker-driven data augmentation と pragmatic inference を活用して自然言語指示に従うナビゲーションを大幅に改善し、Room-to-Room (R2R) で最先端の結果を達成します。

ABSTRACT

Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and also difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three components of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.

研究の動機と目的

Vision-and-language navigation におけるデータ不足と推論の課題に対処する。
外部スピーカーモデルを活用して合成指示で訓練データを増強する。
推論時にプラグマティック推論を取り入れ、指示を最も説明できるルートを選択する。
パノラマ的な高水準アクション空間を採用して計画を簡略化し、一般化を向上させる。

提案手法

attention を用いた seq2seq 構造のフォロワーモデルを構築し、指示をアクション列へマッピングする。
ルートを指示へマッピングする対称的なスピーカーモデルを構築し、合成ルート-指示ペアによるデータ拡張を可能にする。
スピーカーによって生成された拡張データでフォロワーを訓練し、実データでファインチューンする。
テスト時にはフォロワーで K 個の候補ルートを生成し、スピーカーを用いて再スコアリングして pragmatic inference を RSA のようなメカニズムで実行する：argmax P_S(d|r)^λ P_F(r|d)^(1−λ)。
360度視覚入力とワンホップ attention を用いて意思決定を知らせる、Stop アクションを含むパノラマ的なアクション空間を採用する。

実験結果

リサーチクエスチョン

RQ1スピーカーモデルは合成指示データを通じて Vision-and-Language Navigation のデータ効率と一般化を改善できるか？
RQ2外部スピーカーを用いた Pragmatic inference はフォロワーのみのスコアリングよりルート選択を改善するか？
RQ3パノラマ的な高水準アクション空間は低水準の visuomotor コントロールと比較してナビゲーション性能を向上させるか？
RQ4データ拡張、プラグマティク推論、パノラマ空間の組み合わせが unseen 環境の一般化へ与える総合的な影響はどの程度か？

主な発見

データ	実用的推論	パノラミック	Validation-Seen NE	Validation-Seen SR	Validation-Seen OSR	Validation-Unseen NE	Validation-Unseen SR	Validation-Unseen OSR
			6.08	40.3	51.6	7.90	19.9	26.1
	✓		5.05	46.8	59.9	7.30	24.6	33.2
		✓	5.23	51.5	60.8	6.62	34.5	43.1
		✓	4.86	52.1	63.3	7.07	31.2	41.3
✓	✓		4.28	57.2	63.9	5.75	39.3	47.0
✓		✓	3.36	66.4	73.8	6.62	35.5	45.0
	✓	✓	3.88	63.3	71.0	5.24	49.5	63.4
	✓	✓	3.08	70.1	78.3	4.83	54.6	65.2

Speaker-driven data augmentation は validation seen SR を 40.3% から 46.8%、unseen SR を 19.9% から 24.6% に改善する。
Speaker-based rescoring による Pragmatic inference は validation-seen で 57.2%、validation-unseen で 39.3% にまで SR を引き上げ、augmented follower のみの場合の 52.1% および 31.2% から改善する。
パノラマ的アクション空間は SR を validation-seen で 70.1%、validation-unseen で 54.6% に大幅に向上させ、 unseen SR をベースラインの約2倍以上にする。
最終モデルは test unseen 環境で 53.5% SR を達成し、従来の最先端手法を大幅に上回る。
全体として、3つの要素（拡張、プラグマティク推論、パノラマ空間）は性能向上と一般化に意味のある貢献をする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。