Skip to main content
QUICK REVIEW

[论文解读] SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

Byungwoo Jeon, Dongyoung Kim|arXiv (Cornell University)|Mar 23, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

SpatialBoost 通过将 dense 3D cues 转换为语言引导的多轮推理,提升对预训练视觉编码器的 3D 空间理解,在 3D 感知与通用视觉任务上均有一致收益。

ABSTRACT

Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.

研究动机与目标

  • Motivate and address the gap in 3D spatial awareness of 2D-trained vision encoders.
  • Inject dense 3D spatial knowledge via language descriptions using an LLM.
  • Preserve pre-trained knowledge while enabling spatial reasoning through a dual-channel attention mechanism.

提出的方法

  • Extract dense 3D spatial information from images (depth, 3D reconstruction, segmentation, region captions).
  • Convert spatial information into multi-turn, pixel- to scene-level reasoning with an LLM.
  • Align visual encoder features with the LLM embedding space via staged training (feature alignment, visual instruction tuning).
  • Fine-tune the vision encoder with a dual-channel attention module to inject spatial reasoning while preserving pre-trained knowledge.
Figure 1 : Overview of SpatialBoost. We enhance spatial and geometric understanding of pre-trained vision encoders by leveraging language-guided spatial reasoning. SpatialBoost consists of (a) spatial knowledge extraction through depth estimation, 3D reconstruction, segmentation, and region captioni
Figure 1 : Overview of SpatialBoost. We enhance spatial and geometric understanding of pre-trained vision encoders by leveraging language-guided spatial reasoning. SpatialBoost consists of (a) spatial knowledge extraction through depth estimation, 3D reconstruction, segmentation, and region captioni

实验结果

研究问题

  • RQ1Does SpatialBoost improve the spatial understanding of pre-trained vision encoders across 2D and 3D tasks?
  • RQ2Can language-guided, multi-turn spatial reasoning provide transferable gains without catastrophic forgetting?
  • RQ3Which components (LLM decoding, multi-turn reasoning, dual-channel attention) contribute most to performance gains?

主要发现

  • SpatialBoost improves 3D-centric tasks across encoders (e.g., DINOv3, SigLIPv2) and benchmarks.
  • On ADE20K, DINOv3 with SpatialBoost reaches 59.7% mIoU (up from 55.9%), a 3.8 percentage point gain.
  • ImageNet linear probing for DINOv3 rises from 88.4% to 90.2% with SpatialBoost.
  • In 3D scene understanding, Lexicon3D SQA3D BLEU-1 improves from 51.4 to 54.9 with SpatialBoost (OpenCLIP).
  • Depth estimation on NYUd with SigLIPv2 improves from RMSE 0.51 to 0.39 (linear probing).
  • Across tasks, SpatialBoost yields broad gains even for image retrieval/classification, e.g., ImageNet Top-1 for DINOv3 increases from 88.4% to 90.2%.
Figure 2 : Illustration of multi-turn visual spatial reasoning dataset , exhibiting pixel-level, object-level, and scene-level reasoning QAs. At the pixel-level, the QA task queries the 3D positions of points ( e.g . , via depth estimation). At the object-level, it extracts spatial properties of obj
Figure 2 : Illustration of multi-turn visual spatial reasoning dataset , exhibiting pixel-level, object-level, and scene-level reasoning QAs. At the pixel-level, the QA task queries the 3D positions of points ( e.g . , via depth estimation). At the object-level, it extracts spatial properties of obj

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。