QUICK REVIEW

[论文解读] Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response

Rodrigo Gutiérrez, Marita Hueber|arXiv (Cornell University)|Feb 16, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

该论文评估混合现实紧急场景下深度增强的视觉-语言模型反馈，深度输入在空间距离准确性和情境感知方面优于仅视频或深度无关的 VLM 辅助。

ABSTRACT

Large language models (LLMs) are increasingly used in emergency first response (EFR) applications to support situational awareness (SA) and decision-making, yet most operate on text or 2D imagery and offer little support for core EFR SA competencies like spatial reasoning. We address this gap by evaluating a prototype that fuses robot-mounted depth sensing and YOLO detection with a vision language model (VLM) capable of verbalizing metrically-grounded distances of detected objects (e.g., the chair is 3.02 meters away). In a mixed-reality toxic-smoke scenario, participants estimated distances to a victim and an exit window under three conditions: video-only, depth-agnostic VLM, and depth-augmented VLM. Depth-augmentation improved objective accuracy and stability, e.g., the victim and window distance estimation error dropped, while raising situational awareness without increasing workload. Conversely, depth- agnostic assistance increased workload and slightly worsened accuracy. We contribute to human SA augmentation by demonstrating that metrically grounded, object-centric verbal information supports spatial reasoning in EFR and improves decision-relevant judgments under time pressure.

研究动机与目标

通过空间锚定的 AI 反馈提升紧急初级响应员的情境感知。
解决 LLM 在 EFR 情境中对深度不敏感、依赖 2D 线索的缺口。
探究将机器人安装传感器的真实深度融入 VLM 是否能提升空间推理。

提出的方法

将机器人安装的深度传感与基于 YOLO 的物体检测以及视觉-语言模型（VLM）融合，以生成度量学上扎根的距离描述。
将深度和检测输出作为结构化输入提供给 VLM，使其能够以厘米级距离进行口头描述（例如：’受害者在机器人前方大约 0.8 米’）。
在混合现实紧急场景中比较三种条件：仅视频基线、深度无关 VLM 支持、深度增强 VLM 支持。
在一个 MR 办公室灭火/烟雾场景中进行一项有16名参与者的受控实验，使用深度相机、YOLO 检 detections，以及 VLM（qwen2.5vl:32b）。
通过标准化量表评估结果：情境感知（SART）、工作负荷（NASA-TLX）、语音交互质量（SASSI）、可用性（UMUX-Lite）以及距离估计的信心度。

实验结果

研究问题

RQ1RQ: 深度增强的 VLM 是否比（a）仅视频估计或（b）深度无关的 VLM 在为一线救援人员提供更好的空间推理支持？
RQ2RQ: 深度增强的 VLM 是否在时间压力下降低距离估计错误并改善与任务相关的判断？
RQ3RQ: 深度增强如何影响 MR-EFR 任务中的工作负荷、情境感知和感知的交互质量？

主要发现

Measure	C1 Mean (SD)	C2 Mean (SD)	C3 Mean (SD)
NASA-TLX (Workload)	2.56 (1.16)	3.30 (1.26)	3.29 (1.74)
SART (Situational Awareness)	3.70 (1.00)	4.19 (0.91)	4.74 (0.88)
SASSI (Voice Interaction)	4.21 (0.86)	4.41 (0.32)	4.48 (1.12)
UMUX (Perceived Usability)	4.86 (1.46)	4.11 (1.54)	5.00 (0.89)
Confidence (Distance Est.)	3.89 (2.20)	4.11 (2.42)	4.57 (1.90)

深度增强的 VLM 在受害者（真实距离 3.22 m；C1 误差 2.64 m，误差 0.58 m；C3 误差 2.97 m，0.25 m）和窗户（真实距离 4.45 m；C1 误差 4.56 m，0.11 m；C3 误差 4.37 m，0.08 m）处减少了距离估计误差。
深度增强在距离估计的客观精度和变异性（SD）方面有提升，而深度无关的 VLM 相较基线略微降低了精度。
深度增强条件获得最高的情境感知（SART）得分（平均值 4.74，SD 0.88），工作负荷与基线相当，表明在不增加工作负荷的前提下提升了 SA。
语音交互质量（SASSI）在各条件下保持较高，显示感知的语音交互稳定。
感知有用性（UMUX-Lite）在深度增强支持下最高，与任务需求一致。
参与者在深度增强支持下对距离估计的信心（C3）高于基线或深度无关支持。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。