QUICK REVIEW

[論文レビュー] Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Yuchen Li, Amanmeet Garg|arXiv (Cornell University)|Mar 19, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

Perceptio は自動回帰 LVLM へ明示的な 2D セマンティックセグメンテーション・トークンと離散化された 3D 深度トークンを注入し、連続内の空間知覚を可能にし、RES、空間推論、VQA 指標での grounding を改善します。

ABSTRACT

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

研究の動機と目的

LVLM における意味理解を超えた明示的な空間 grounding の必要性を動機づける。
自動回帰生成過程に 2D セグメンテーションと 3D 深度トークンを注入する方法を提案する。
エンドツーエンド訓練で深度トークンの出現を安定化させる新しい深度トークン損失とソフト再構成を開発する。
セグメンテーション、深度、言語タスクを跨ぐモデル訓練のための共同 perception-annotated データセットを作成する。
referring expression segmentation および関連の空間推論ベンチマークで最先端の性能を示す。

提案手法

Segmentation のトークンを出力する LVLM Perceptio を提案する（seg トークン）、テキスト生成前に離散化された depth を表す一連の depth トークンを出力する。
Depth Anything V2 の予測で訓練された VQ-VAE 深度コードブックを用いて深度を K コードへ離散化し深度トークンを形成する。
クエリ文に条件付けられた SAM2 ベースのセグメンテーショントークンを導入しセグメンテーションデコーダを誘導する。
LLM 損失、セグメンテーション再構成損失（CE + Dice）、深度トークン生成損失、ソフトコードブック統合による differentiable 深度再構成損失を組み合わせた多タスク目的で訓練する。
生成の固定出力順序を課す：セグメンテーショントークン、深度トークン、最終回答、これにより空間的な連鎖思考 grounding プロセスを誘導する。
参照表現セグメンテーション（RefCOCO/+/g）に対して aligned 深度トークンと物体記述を付加した結合データを作成し、画像 QA、 grounding、深度誘導データの混合で微調整する。

実験結果

リサーチクエスチョン

RQ12D の意味的セグメンテーションと 3D 深度推論を、外部パイプラインなしで単一の自動回帰 LVLM に明示的に統合するにはどうすればよいか。
RQ2離散的な深度トークンの生成を安定させ、微分可能な深度再構成を可能にする損失関数と訓練戦略とは。
RQ3 perception トークンのシーケンス内生成は、細粒度の空間 grounding と diverse ベンチマークでの VQA 性能を改善するか。

主な発見

Perceptio は RefCOCO、RefCOCO+、RefCOCOg において Sa2VA-8B よりも高い cIoU（それぞれ 82.7、77.9、80.0）で最先端の参照セグメンテーションを達成。
HardBLINK の空間推論精度は Perceptio-8B で平均 10.3 ポイント向上（3/4/5 点で 75.8/71.0/66.1、平均 71.0）。
MMBench の精度は Perceptio-8B で 83.4、SEED-Bench で 75.7、MME の perception/cognition スコアは 1654/628。
Perceptio-4B は複数の指標で既存の大規模ベースラインを上回る、提案された perception トークンによる強力な利益を示す。
アブレーティブな結果は深度トークンが 3D 空間推論に必須であることを示し、セグメンテーショントークンは VQA 系推論を補完することを示す。深度を欠くと HardBLINK が大幅に劣化し、セグメンテーションを欠くと MME/MMBench/SEED の性能が低下。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。