QUICK REVIEW

[论文解读] Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Yuchen Li, Amanmeet Garg|arXiv (Cornell University)|Mar 19, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

Perceptio 将显式的二维分割标记和离散化的三维深度标记注入自回归 LVLM，使得序列内的空间感知成为可能，并在 RES、空间推理和 VQA 基准上提升对齐能力。

ABSTRACT

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

研究动机与目标

将 LVLM 在语义理解之外对显式空间对齐的需求动机化。
提出一种在 LVLM 的自回归生成过程中注入二维分割和三维深度标记的方法。
在端到端训练中开发新的深度标记损失与软重建，以稳定深度标记的输出。
创建一个联合感知标注数据集，用于在分割、深度和语言任务上对模型进行训练。
在指称表达分割及相关空间推理基准上展示最先进的性能。

提出的方法

引入 Perceptio，一种在生成文本之前输出用于分割的 [seg] 标记和表示离散深度的 [depth] 标记序列的 LVLM。
使用在 Depth Anything V2 预测上训练的 VQ-VAE 深度码本，将深度离散化为 K 个代码，形成深度标记。
结合基于 SAM2 的分割标记，条件于查询文本以引导分割解码。
采用多任务目标训练，结合 LLM 损失、分割重建损失（CE + Dice）、深度标记生成损失，以及通过软码本合并实现的可微深度重建损失。
在生成中强制固定输出顺序：先分割标记、再深度标记，最后给出答案，以诱导一个空间连贯推理的 grounding 过程。
curate 一个联合数据集，在指称表达分割（RefCOCO/+/g）上附加对齐的深度标记和对象描述，并在图像问答、对齐与深度引导数据的混合数据集上进行微调。

实验结果

研究问题

RQ1如何在无外部管线的情况下，将二维语义分割与三维深度推理显式整合到一个自回归 LVLM 中？
RQ2哪些损失函数与训练策略能够稳定离散深度标记的生成并实现可微的深度重建？
RQ3序列内生成感知标记是否提升在多种基准上的细粒度空间对齐与 VQA 性能？

主要发现

Perceptio 在 RefCOCO、RefCOCO+、RefCOCOg 的分指称分割上达到了最先进的水平（cIoU 分别为 82.7、77.9、80.0，优于 Sa2VA-8B）。
HardBLINK 的空间推理准确率在 Perceptio-8B 的影响下平均提升了 10.3 个百分点（3/4/5 点分别为 75.8/71.0/66.1，均值 71.0）。
MMBench 准确率在 Perceptio-8B 下达到 83.4，在 SEED-Bench 为 75.7；MME 感知/认知分数为 1654/628。
Perceptio-4B 已在多个指标上超过更大的基线，显示出所提出感知标记的强大收益。
消融结果表明深度标记对于三维空间推理不可或缺，分割标记则补充 VQA 风格推理；移除深度会显著降低 HardBLINK，移除分割则降低 MME/MMBench/SEED 的表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。