QUICK REVIEW

[论文解读] How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers

Junting Chen, Guohao Li|arXiv (Cornell University)|May 26, 2023

Multimodal Machine Learning Applications被引用 2

一句话总结

本文提出 StructNav，一种无需训练的模块化方法，用于具身物体目标导航，利用视觉 SLAM、语义点云和空间场景图实现语义感知的前沿探索。通过在几何前沿中注入语言和场景先验知识，该方法在 Gibson 基准上实现了最先进性能，无需端到端训练，优于以往依赖大量学习的方法，同时凸显了语义分割作为关键瓶颈。

ABSTRACT

Object goal navigation is an important problem in Embodied AI that involves guiding the agent to navigate to an instance of the object category in an unknown environment -- typically an indoor scene. Unfortunately, current state-of-the-art methods for this problem rely heavily on data-driven approaches, \eg, end-to-end reinforcement learning, imitation learning, and others. Moreover, such methods are typically costly to train and difficult to debug, leading to a lack of transferability and explainability. Inspired by recent successes in combining classical and learning methods, we present a modular and training-free solution, which embraces more classic approaches, to tackle the object goal navigation problem. Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework. We then inject semantics into geometric-based frontier exploration to reason about promising areas to search for a goal object. Our structured scene representation comprises a 2D occupancy map, semantic point cloud, and spatial scene graph. Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers. With injected semantic priors, the agent can reason about the most promising frontier to explore. The proposed pipeline shows strong experimental performance for object goal navigation on the Gibson benchmark dataset, outperforming the previous state-of-the-art. We also perform comprehensive ablation studies to identify the current bottleneck in the object navigation task.

研究动机与目标

解决基于深度学习的具身物体目标导航方法在数据需求量大、难以调试和不可迁移方面的局限性。
开发一种模块化、免训练的流水线，结合经典 SLAM 与语义推理，以提升可解释性与真实世界可部署性。
探究语义先验是否能在无强化学习的情况下，有效引导未见过环境中的前沿探索。
识别并分析当前语义感知流水线在物体导航中的瓶颈。

提出的方法

利用视觉 SLAM 构建结构化场景表征，生成 2D 占位图、语义点云和空间场景图。
通过空间场景图，将来自预训练语言模型的语言先验和来自训练数据统计的场景先验注入几何前沿。
提出一种语义前沿（SemFrontier）模块，利用语义知识对未探索前沿进行评分，实现对最有希望探索目标的推理。
利用语义点云和场景图在环境中传播语义信息，提升物体搜索效率。
在选定有前景的前沿后，采用快速行进路径规划器实现点到点导航。
避免端到端训练，依赖模块化组件：SLAM 负责几何建模，语义分割负责标签生成，基于规则的推理负责探索决策。

实验结果

研究问题

RQ1一种免训练、模块化的方法是否能在物体目标导航中超越端到端学习方法？
RQ2当与几何前沿结合时，语言和场景先验在引导前沿探索方面的有效性如何？
RQ3语义分割质量对整体导航性能的影响是什么？
RQ4结构化场景表征是否能提升具身智能中的泛化能力与可解释性？

主要发现

StructNav 在 Gibson 基准上的成功率达到 84.2%，优于以往需要大量训练的最先进方法。
该方法的 SPL（成功加权路径长度）达到 0.563，显著优于先前最先进方法，表明其在高成功率与高效导航方面均表现优异。
消融实验表明，语义分割误差率达到 50% 时，系统性能在所有指标上均降至真实标签基线的一半以下。
随机丢弃 50% 的分割标签对性能影响极小，表明 SLAM 流水线中的时间整合机制可有效缓解标签噪声。
语义分割模型被确定为主要瓶颈，随着分割错误增加，性能急剧下降。
语言先验与场景统计的使用实现了无需训练的有效语义推理，支持了混合经典-学习方法的可行性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。