Skip to main content
QUICK REVIEW

[论文解读] VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory

Shaoan Wang, Yuanfei Luo|arXiv (Cornell University)|Jan 13, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

VLingNav 是一个面向具身导航的视觉-语言-行动模型,利用自适应链式推理和持续的视觉辅助语言记忆以实现最先进的结果和零-shot现实世界转移。

ABSTRACT

VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.

研究动机与目标

  • 通过以语言表示为基础的显式、可自适应推理和持续的跨模态记忆来推动具身导航。
  • 开发 AdaCoT,以在速度与深思之间动态触发推理。
  • 引入 VLingMem,构建可持续的跨模态(视觉-语言)记忆,以支持长程任务。
  • 创建 Nav-AdaCoT-2.9M,这是一个具有自适应 CoT 注释的用于监督训练的大型具身导航数据集。
  • 实现在线专家引导的强化学习,以提升鲁棒性,超越模仿学习。

提出的方法

  • 提出自适应链式推理(AdaCoT),基于任务复杂性在快速执行和慢速规划之间切换。
  • 开发视觉辅助语言记忆(VLingMem),存储并回忆跨模态语义记忆,用于长程导航。
  • 在基于视频的视觉语言模型(LLaVA-Video-7B)上扩展一个将 VLM 输出转化为连续机器人轨迹的行动模型。
  • 构建 Nav-AdaCoT-2.9M,这是具有推理注释和自适应 CoT 标签的最大规模具身导航数据集。
  • 在开放世界自适应 CoT 视频数据上进行预训练,进行模仿学习的监督微调,并在训练后应用在线专家引导的 RL。
  • 使用在线、概率性的连续动作头输出连续机器人动作,从而实现端到端策略学习。
Figure 1 : Overview of VLingNav. VLingNav is a VLA model enhanced with adaptive CoT reasoning and visual-assisted linguistic memory. This architecture allows the model to leverage historical visual and linguistic memory, achieving SOTA results on several embodied navigation benchmarks. Furthermore,
Figure 1 : Overview of VLingNav. VLingNav is a VLA model enhanced with adaptive CoT reasoning and visual-assisted linguistic memory. This architecture allows the model to leverage historical visual and linguistic memory, achieving SOTA results on several embodied navigation benchmarks. Furthermore,

实验结果

研究问题

  • RQ1自适应推理如何在长程具身导航任务中提升效率和成功率?
  • RQ2持续的语言记忆是否有助于记忆回忆并减少动态环境中的重复探索?
  • RQ3将自适应 CoT 与视觉辅助记忆结合,是否在 VLN、ObjectNav、ImageNav 任务上实现最先进结果?
  • RQ4在线专家引导的强化学习是否能进一步提升超越模仿学习的导航鲁棒性?
  • RQ5采用语言驱动的认知框架,零-shot 转移到现实世界机器人是否可行?

主要发现

  • VLingNav 在标准具身导航基准上实现最先进的性能。
  • AdaCoT 能基于情境需要在快速执行与深思之间动态切换。
  • VLingMem 提供持续的跨模态记忆,减少冗余探索并有助于推断运动趋势。
  • Nav-AdaCoT-2.9M 提供用于监督训练的大规模含推理注释数据。
  • 在线专家引导的 RL 后训练提升导航鲁棒性,超越模仿学习。
  • 零-shot 转移到现实世界机器人展示了跨领域和跨任务的泛化能力。
Figure 2 : The overall framework of VLingNav. The framework takes video streams and multimodal instruction as input to produce robot action for navigation with tailored linguistic designs. AdaCoT can adaptively generate linguistic thinking according to its observation, while VLingMem summarizes CoT
Figure 2 : The overall framework of VLingNav. The framework takes video streams and multimodal instruction as input to produce robot action for navigation with tailored linguistic designs. AdaCoT can adaptively generate linguistic thinking according to its observation, while VLingMem summarizes CoT

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。