QUICK REVIEW

[論文レビュー] Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

Chih‐Yao Ma, Jiasen Lu|arXiv (Cornell University)|Jan 10, 2019

Robotic Path Planning Algorithms参考文献 51被引用数 134

ひとこと要約

この論文は視覚と言語の自己監視型 Vision-and-Language Navigation エージェントを提案し、視覚-テキストの共 grounding と進捗モニターを備え、Room-to-Room で現状最適結果を達成し、特に unseen 環境での success rate が 8% の absolute 増加を示します。

ABSTRACT

The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. This challenging task demands that the agent be aware of which instruction was completed, which instruction is needed next, which way to go, and its navigation progress towards the goal. In this paper, we introduce a self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress. We test our self-monitoring agent on a standard benchmark and analyze our proposed approach through a series of ablation studies that elucidate the contributions of the primary components. Using our proposed method, we set the new state of the art by a significant margin (8% absolute increase in success rate on the unseen test set). Code is available at https://github.com/chihyaoma/selfmonitoring-agent .

研究の動機と目的

エージェントが explicit goal maps なしに、どの指示が完了しており次に何をすべきかを知る必要がある VLN の動機付けと課題設定。
過去/今後の指示と周囲画像から現在の動作を grounding する visual-textual co-grounding モジュールの開発。
grounding を正規化するために、 instruction の達成度とゴールへの進捗を推定する progress monitor の導入。
grounding と progress signal をアクション選択と beam-search 推定に統合してナビゲーション性能を向上。

提案手法

視覚テキスト共 grounding と progress monitor の二要素エージェントを提案し、視覚情報と指示の同時 grounding および progress 推定を可能にする。
各ステップで grounded なテキスト特徴と視覚特徴を計算する attention を用いた sequence-to-sequence の LSTM ベースのアーキテクチャを採用。
指示語の soft attention と位置エンコードを用いたテキスト grounding を計算し、パノラマビュー特徴に対する attention で視覚情報を grounding。
grounding された指示と視覚文脈を組み合わせて inner-product スコアリングと navigable 方向の softmax でアクションを選択。
history、 grounding された visuals、テキスト attention から progress signal p_t^{pm} を計算し、学習を正規化する progress monitor を導入。
アクション選択のクロスエントロピーと progress 推定の回帰項を結合したジョイント損失で訓練；推論時には progress signals を beam scoring に統合した beam search を使用。

実験結果

リサーチクエスチョン

RQ1視覚とテキストのモダリティで grounding を共同に行い、どの指示が完了しており次に何が必要かを決定するにはどうすればよいか。
RQ2progress 推定モジュールは grounding を正規化し、VLN タスクにおけるゴールへのナビゲーション進捗を改善できるか。
RQ3 beam search へ progress signals を統合することで unseen 環境での一般化を改善できるか。
RQ4 co-grounding と progress monitoring が R2R における最先端性能へどのように寄与しているか。
RQ5 augmentation あり/なしでデータ効率は従来より改善されるか。

主な発見

Method	NE (Validation-Seen)	SR (Validation-Seen)	OSR (Validation-Seen)	SPL (Validation-Seen)	NE (Validation-Unseen)	SR (Validation-Unseen)	OSR (Validation-Unseen)	SPL (Validation-Unseen)	NE (Test Unseen)	SR (Test Unseen)	OSR (Test Unseen)	SPL (Test Unseen)
Random	9.45	0.16	0.21	-	9.23	0.16	0.22	-	9.77	0.13	0.18	-
Student-forcing	6.01	0.39	0.53	-	7.81	0.22	0.28	-	7.85	0.20	0.27	-
RPA	5.56	0.43	0.53	-	7.65	0.25	0.32	-	7.53	0.25	0.33	-
Speaker-Follower	3.88	0.63	0.71	-	5.24	0.50	0.63	-	-	-	-	-
Speaker-Follower* (leaderboard)	3.08	0.70	0.78	-	4.83	0.55	0.65	-	4.87	0.53	0.64	-
Ours (beam search) (leaderboard)	3.23	0.70	0.78	0.66	5.04	0.57	0.70	0.51	4.99	0.57	0.68	0.51
-	-	-	-	-	-	-	-	4.99	0.57	0.95	0.02
Ours* (beam search) (leaderboard)	3.04	0.71	0.78	0.67	4.62	0.58	0.68	0.52	4.48	0.61	0.70	0.56

seen および unseen の R2R 分割で最先端の結果を達成し、unseen テストセットでの SR が 8% ポイントの絶対改善。
共 grounding フレームワーク（視覚とテキスト）は、両モダリティに共通の隠れ状態を活用することでベースラインを大きく上回る。
progress monitor の正規化は seen でも unseen でも SR を改善し、データ拡張なしでも従来技術を超えるために重要。
progress 推定を組み込んだ beam-search は unseen 環境で特に追加の利得を生み出す。
テキスト grounding を用いた注意は、時間とともに指示の焦点が diagonally 進行するような挙動を示し、指示をアクションへ効果的に grounding していることを示唆。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。