QUICK REVIEW

[论文解读] SpatiaLoc: Leveraging Multi-Level Spatial Enhanced Descriptors for Cross-Modal Localization

Tianyi Shang, Pengjie Xu|arXiv (Cornell University)|Jan 7, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

SpatiaLoc 引入了一种自粗到细的跨模态定位框架，通过 Bezier 增强的空间编码和频域特征，以及不确定性感知的二维定位，在 KITTI360Pose 上超过最先进方法。

ABSTRACT

Cross-modal localization using text and point clouds enables robots to localize themselves via natural language descriptions, with applications in autonomous navigation and interaction between humans and robots. In this task, objects often recur across text and point clouds, making spatial relationships the most discriminative cues for localization. Given this characteristic, we present SpatiaLoc, a framework utilizing a coarse-to-fine strategy that emphasizes spatial relationships at both the instance and global levels. In the coarse stage, we introduce a Bezier Enhanced Object Spatial Encoder (BEOSE) that models spatial relationships at the instance level using quadratic Bezier curves. Additionally, a Frequency Aware Encoder (FAE) generates spatial representations in the frequency domain at the global level. In the fine stage, an Uncertainty Aware Gaussian Fine Localizer (UGFL) regresses 2D positions by modeling predictions as Gaussian distributions with a loss function aware of uncertainty. Extensive experiments on KITTI360Pose demonstrate that SpatiaLoc significantly outperforms existing state-of-the-art (SOTA) methods.

研究动机与目标

使用自然语言描述和城市尺度点云地图来激发跨模态定位，其中相同对象在不同位置重复出现。
提出一个自粗到细的框架，通过利用实例级和全局级的空间关系来对齐文本与点云。
引入具体模块（BEOSE、FAE、UGFL）来建模空间线索与不确定性以实现鲁棒定位。
在 KITTI360Pose 上展示相较於现有 SOTA 方法的显著经验提升。

提出的方法

粗阶段使用 Bezier Enhanced Object Spatial Encoder (BEOSE) 来通过二次 Bezier 曲线细化实例级空间关系。
粗阶段中的全局层使用 Frequency Aware Encoder (FAE) 将子图特征投影到频域以获得鲁棒的全局描述符。
细阶段使用 Uncertainty Aware Gaussian Fine Localizer (UGFL) 将 2D 位置回归为带不确定性感知损失和跨模态循环融合的高斯分布。
相对空间图构建将视觉特征和空间偏移融合，形成视觉与文本模态的边表示。
Gaussian Aggregation (GA) 将成对边特征压缩为节点层描述符，采用概率化（重参数化）聚合。
粗阶段跨模态对齐使用全局、实例层和对象层损失的组合来优化检索与判别性。

Figure 1: The overall architecture of the proposed SpatiaLoc. The left panel illustrates the coarse stage, which utilizes the BEOSE for instance-level spatial alignment and the FAE to extract frequency-domain spatial geometric features for global-level alignment. The right panel depicts the Fine Sta

实验结果

研究问题

RQ1是否通过显式建模重复对象之间的相对空间关系来改进文本到点云的定位？
RQ2实例级 Bezier 编码的空间线索和全局频域特征是否提升粗阶段子图检索，与现有 SOTA 相比？
RQ3在细阶段的带不确定性的高斯建模是否在跨模态歧义下提升鲁棒的二维定位？
RQ4多层次（实例与全局）空间表示如何相互作用以改善跨模态对齐？

主要发现

Methods	Validation k=1	Validation k=3	Validation k=5	Test k=1	Test k=3	Test k=5
Text2Pos	0.14	0.28	0.37	0.12	0.25	0.33
RET	0.18	0.34	0.44	0.15	0.29	0.37
Text2Loc	0.31	0.54	0.64	0.28	0.49	0.58
IFRP-T2P	0.24	0.46	0.57	0.23	0.39	0.48
MambaPlace	0.35	0.61	0.72	0.31	0.53	0.62
CMMLoc	0.35	0.61	0.73	0.32	0.53	0.63
PMSH	0.37	0.63	0.73	0.34	0.56	0.65
SpatiaLoc (Global)	0.51	0.71	0.71?	0.?	0.??	0.??
SpatiaLoc (coarse-to-fine)	0.54	0.77	0.82	0.51	0.71	0.74

SpatiaLoc（自粗到细）在 KITTI360Pose 的粗阶段和细阶段都实现了最先进的召回率，在具有挑战性的测试集上获得显著提升。
在粗阶段检索中，SpatiaLoc 实现了召回率提升（例如在测试集 k=1 时为 0.48，PMSH 为 0.34），并在较高 k 值时获得强劲提升（k=5 时为 0.80）。
Frequency Aware Encoder (FAE) 在频域提供鲁棒的全局描述符，即使仅使用全局特征也能实现强劲的粗阶段检索。
BEOSE 大幅提升了性能；去掉它会使 Recall@1 降低约 9 个百分点。
GA 和带不确定性感知的 UGFL 共同推动细阶段的鲁棒融合与回归，消融实验显示移除时会有可观的下降。
总体而言，SpatiaLoc 在子图检索和精确定位上持续优于以往 SOTA 方法，验证了自粗到细、多层次空间策略的有效性。

Figure 2: Visualization Results for SpatiaLoc.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。