QUICK REVIEW

[論文レビュー] Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction

Ari Wahl, Dorian Gawlinski|arXiv (Cornell University)|Mar 1, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

この研究は LoRA を用いた Vision-Language Model のファインチューニングにより、単一のモノクロ画像から3D物体位置を推定し、中央値 MAE は 13 mm、中央値ユークリッド誤差は 27 mm、ロボット操作タスクにおいて座標ごとに 10 mm 未満の予測が全体の約 25% に達することを示す。

ABSTRACT

Pre-trained general-purpose Vision-Language Models (VLM) hold the potential to enhance intuitive human-machine interactions due to their rich world knowledge and 2D object detection capabilities. However, VLMs for 3D coordinates detection tasks are rare. In this work, we investigate interactive abilities of VLMs by returning 3D object positions given a monocular RGB image from a wrist-mounted camera, natural language input, and robot states. We collected and curated a heterogeneous dataset of more than 100,000 images and finetuned a VLM using QLoRA with a custom regression head. By implementing conditional routing, our model maintains its ability to process general visual queries while adding specialized 3D position estimation capabilities. Our results demonstrate robust predictive performance with a median MAE of 13 mm on the test set and a five-fold improvement over a simpler baseline without finetuning. In about 25% of the cases, predictions are within a range considered acceptable for the robot to interact with objects.

研究の動機と目的

単眼 RGB 画像からロボット設定での 3D 物体位置推定を実現する。
3D 座標回帰の専門化を追加しつつ、一般的な VLM 能力を維持する。
LoRA ベースのファインチューニングを活用し、基本モデルをそのまま保ちつつ条件付きルーティングを実現する。

提案手法

事前学習済みの汎用 Vision-Language Model をベースとして使用。
Low-Rank Adaptation (LoRA) と 3D 座標回帰ヘッドでファインチューニング。
一般的な VLM クエリと 3D 回帰タスクを分離する条件付きルーティングを実装。
手首に取り付けたカメラで収集した大規模なロボット空間データセットで訓練。
保持セットで MAE とユークリッド距離で評価。
ベースと専門パスウェイ間のルーティングを柔軟に行い、オープンセット機能を維持。

実験結果

リサーチクエスチョン

RQ1モノクロ RGB 画像からロボット作業空間で信頼できる 3D 物体座標を VLM で得られるか。
RQ2LoRA と回帰ヘッドでのファインチューニングは、ベースラインと比較して 3D 座標精度をどう変えるか。
RQ3条件付きルーティングは一般的な VLM 機能を保持しつつ、タスク固有の 3D 推定を実現するか。
RQ4物体タイプや視環境による誤差特性（例：z 高さの不確実性）はどうなるか。

主な発見

LLaVA-v1.5 ベースの最良モデルはテストセットで中央値 MAE が 13 mm。
テストセットの中央値ユークリッド誤差は 27 mm。
予測の約 25% が座標ごとに平均誤差 10 mm 未満で、把持や押圧タスクに適する可能性。
見かけ上の未知物体や varied lighting・物体形状へのオープンセット一般化が示され、多くのケースで MAE が 20 mm 未満となる。75% のケース。
5 骨組みのクロスバリデーションは、ファインチューニングなしの単純なベースラインに対して有意な改善を示し（約 5 倍程度）、相当の改善。
Z 座標誤差（高さ）は一般に x/y 座標より大きく不確実であり、モノクロ深度の課題を反映。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。