QUICK REVIEW

[論文レビュー] CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Haozhou Li, Xiangyu Dong|arXiv (Cornell University)|Mar 9, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

CMMR-VLN は、継続的な多modalメモリを retrieval-augmented reasoning と reflection を用いて vision-and-language navigation に追加し、ゼロショットと実世界での性能を強化する。

ABSTRACT

Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.

研究の動機と目的

prior multimodal experiences の選択的 recalling を可能にすることで VLN を改善する動機付け。
retrieval のために panoramic views と salient landmarks を格納する構造化された multimodal memory (MEM) を提案。
retrieved experiences で意思決定を grounding する retrieval-augmented generation パイプラインを導入。
successes および failures からの continual memory update を行う reflection ベースのメモリ更新を実装。

提案手法

panorama 情報と salient landmark テキストを含む viewpoint ごとの Multimodal Experience Memory (MEM) を CLIP でエンコードし FAISS でインデックス化して構築。
instruction と candidate-view embedding を統合して relevant past experiences を retrieval し grounded な action plan を生成する Retrieval-Augmented Generation Pipeline (RAGP) を使用。
memory guided reasoning を LLM の分析・計画・行動ステップを導く明示的な navigation rule R として表現。
global ルート計画を支援するダイナミックなセマンティックトポロジカルマップをナビゲーションとともに拡張。
各エピソード後に reflection モジュールを適用し、完全な成功軌跡と最初のミスケースを MEM に格納してエントリを pruning または強化するルールを適用。

Figure 1: The overall CMMR-VLN framework consists of three modules from left to right. The Multimodal Experience Memory (MEM) performs memory building before navigation. The Retrieval-Augmented Generation Pipeline (RAGP) carries out corresponding prompting and action execution at each navigation ste

実験結果

リサーチクエスチョン

RQ1継続的な多modal memory retrieval は VLN における指示 grounding と長期的な計画を改善するか？
RQ2 retrieved experiences と明示的な navigation rules で意思決定を grounding することは retrieval なしのベースラインよりナビゲーション指標を改善するか？
RQ3 reflection ベースの memory update は未知環境や実世界で継続的な改善を可能にするか？
RQ4 semantic topological map の統合は VLN におけるグローバル探索と効率性に影響するか？
RQ5 明示的な推論プロンプト（navigation rules）の導入は LLM 主導 VLN の性能に与える影響は？

主な発見

Method	NE↓	OSR↑	SR↑	SPL↑
NavGPT	6.46	42	34	29
MapGPT	5.63	57	43	34
DiscussNav	5.32	61	43	40
CMMR-VLN(Ours)	5.10	63	52	51

CMMR-VLN は NavGPT に対してシミュレーションで SR を 52.9%、MapGPT に対しては SR の 50% の向上を達成。
R2R の unseen バリデーション設定で、CMMR-VLN は NE 5.10、OSR 63、SR 52、SPL 51 を達成し、NavGPT、MapGPT、DiscussNav をすべての4指標で上回る。
実世界の TurtleBot 4 Lite テストでは NavGPT より SR が 200%、MapGPT より 50%、DiscussNav より 50% の改善。
アブレーションにより、明示的な navigation rules または reflection を除去すると性能が低下することが示され、 retrieved-rule grounding および continual memory 更新の重要性を強調。
ケーススタディは retrieved experiences が candidate views の曖昧性解消と prior successes の活用を通じて目標に到達する方法を示す。
本手法は panoramic views と landmarks を備える構造化 MEM、step-wise retrieval-grounded reasoning のための RAGP、継続学習のための reflection モジュールを使用。

Figure 2: Details of the Reflection Module in Fig 1.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。