QUICK REVIEW

[論文レビュー] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu|arXiv (Cornell University)|Jan 22, 2024

Multimodal Machine Learning Applications被引用数 5

ひとこと要約

SpatialVLM は大規模な合成3D空間推論データセット上でビジョン-ランゲージモデルを訓練し、2D画像からのメトリック距離推定を含む定性的・定量的空間推論や、LLM との連携による思考連鎖型空間推論を可能にします。

ABSTRACT

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size differences. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in the training recipe, including data quality, training pipeline, and VLM architecture. Our work features the first internet-scale 3D spatial reasoning dataset in metric space. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA. Finally, we demonstrate that this VLM unlocks novel downstream applications in chain-of-thought spatial reasoning and robotics due to its quantitative estimation capability. Project website: https://spatial-vlm.github.io/

研究の動機と目的

現在のVLMにおける3D空間推論の欠如を動機づけ、対応する。
実世界画像から3D空間推論VQAデータを自動かつスケーラブルに生成するデータ生成パイプラインを開発する。
合成空間QAデータでVLMを訓練し、定性的・定量的空間推論を向上させる。
ロボティクス、報酬アノテーション、LLM連携による思考連鎖型空間推論の下流タスクでの利点を示す。

提案手法

オープン語彙検出、メトリック深度推定、セマンティックセグメンテーション、物体中心のキャプション作成を用いて実画像から物体コンテキストを抽出する。
深度を推定して3DポイントクラウドへCanonicalized座標に変換することで2Dコンテキストを3Dへリフトする。
テンプレートベースの質問（定性的・定量的）を用いて10M画像で2B個の空間QAペアを生成する。
PaLM-E ファミリのデータと SpatialVLM 空間データの混合で Vision-Language Model（PaLM-E 系列）を訓練し、5% の空間トークンを投入する。
SpatialVLM の出力を大規模言語モデル（例: GPT-4）と連携させて多段階の空間タスクの思考連鎖型推論を可能にする。
データ品質、訓練パイプライン、ViT の凍結が空間推論能力に与える影響を調査する。

実験結果

リサーチクエスチョン

RQ1合成3D空間推論データは、VLM の定性的・定量的空間質問への回答能力を向上させるか。
RQ2データ品質、訓練戦略、モデルの凍結は空間推論性能にどのように影響するか。
RQ3SpatialVLM は下流のロボティクス作業や思考連鎖推論をサポートする信頼性の高い距離/サイズ推定を提供できるか。

主な発見

モデル	精度
GPT-4V	68.0%
LLaVA-1.5	71.3%
InstructBLIP	60.4%
PaLI	60.7%
PaLM-E	50.2%
PaLM 2-E	50.4%
Ours	75.2%

SpatialVLM は二値述語タスクで GPT-4V、LLaVA-1.5、InstructBLIP、PaLI、PaLM-E、PaLM 2-E より高い定性的空間推論精度を達成（75.2% vs 68.0–60.7%）。
SpatialVLM は定量的空間質問でより高い精度を達成し、距離推定の範囲内に出力が収まることが多く、距離関連の指標でベースラインを上回る。
SpatialVLM データと共訓練することで OKVQA および VQA v2 で競争力のある VQA 性能を達成（例：SpatialVLM データなしの PaLM 2-E に比べ VQA v2 の改善が 2.4%）。
ViT のアンフリーズは距離推定の細かな精度を向上させる；凍結した ViT は距離レンジ全体で精度が低い。
ノイズのある空間データで訓練された VLM も一般的な空間推論を学習し、定量的回答のノイズレベルに対して頑健性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。