QUICK REVIEW

[论文解读] Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction

Ari Wahl, Dorian Gawlinski|arXiv (Cornell University)|Mar 1, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

本工作通过 LoRA 微调视觉-语言模型，以从单目图像预测 3D 对象位置，达到中位 MAE 13 mm 和中位欧氏误差 27 mm，且在每个坐标上有 25% 的预测在 10 mm 内，适用于机器人交互任务。

ABSTRACT

Pre-trained general-purpose Vision-Language Models (VLM) hold the potential to enhance intuitive human-machine interactions due to their rich world knowledge and 2D object detection capabilities. However, VLMs for 3D coordinates detection tasks are rare. In this work, we investigate interactive abilities of VLMs by returning 3D object positions given a monocular RGB image from a wrist-mounted camera, natural language input, and robot states. We collected and curated a heterogeneous dataset of more than 100,000 images and finetuned a VLM using QLoRA with a custom regression head. By implementing conditional routing, our model maintains its ability to process general visual queries while adding specialized 3D position estimation capabilities. Our results demonstrate robust predictive performance with a median MAE of 13 mm on the test set and a five-fold improvement over a simpler baseline without finetuning. In about 25% of the cases, predictions are within a range considered acceptable for the robot to interact with objects.

研究动机与目标

在机器人场景中实现从单目 RGB 图像的 3D 对象位置估计。
在保持通用 VLM 能力的同时，为 3D 坐标回归添加专门化。
利用基于 LoRA 的微调，保持基础模型完整性并实现条件路由。

提出的方法

以预训练的通用目的 Vision-Language Model 作为基础。
使用 Low-Rank Adaptation (LoRA) 和用于 3D 坐标回归的回归头进行微调。
实现条件路由，将通用 VLM 查询与 3D 回归任务分离。
在用腕部安装的相机收集的大型机器人工作空间数据集上训练。
在保留集上使用 MAE 和欧氏距离进行评估。
通过路由查询在基础与专门化路径之间灵活切换，保持开放集能力。

实验结果

研究问题

RQ1单目 RGB 图像是否可以在机器人工作空间中通过 VLM 可靠地产生 3D 对象坐标？
RQ2使用 LoRA 和回归头进行微调对 3D 坐标准确性相较于基线有何影响？
RQ3条件路由是否在实现任务特定的 3D 估计的同时保留通用 VLM 功能？
RQ4在不同对象类型与视角条件下，误差特征（如 z 高度不确定性）如何？

主要发现

在测试集上，使用 LLaVA-v1.5 基础的最佳模型的中位 MAE 为 13 mm。
测试集的中位欧氏误差为 27 mm。
约 25% 的预测在每个坐标的平均误差小于 10 mm，可能适用于抓取或推送任务。
对未见对象、光照和物体形状变化的开放集泛化得到证明，大多数情况 MAE 小于 20 mm，占比 75%。
五折交叉验证显示相较于不微调的简单基线有显著提升（约五倍）。
Z 坐标误差（高度）通常比 x/y 坐标更大且不确定性更高，反映单目深度挑战。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。