[Paper Review] SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
SpatialBoost augments pre-trained vision encoders with 3D spatial knowledge by converting dense 3D cues into language-guided, multi-turn reasoning using an LLM, achieving consistent gains across 3D-aware and general vision tasks.
Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.
Motivation & Objective
- Motivate and address the gap in 3D spatial awareness of 2D-trained vision encoders.
- Inject dense 3D spatial knowledge via language descriptions using an LLM.
- Preserve pre-trained knowledge while enabling spatial reasoning through a dual-channel attention mechanism.
Proposed method
- Extract dense 3D spatial information from images (depth, 3D reconstruction, segmentation, region captions).
- Convert spatial information into multi-turn, pixel- to scene-level reasoning with an LLM.
- Align visual encoder features with the LLM embedding space via staged training (feature alignment, visual instruction tuning).
- Fine-tune the vision encoder with a dual-channel attention module to inject spatial reasoning while preserving pre-trained knowledge.

Experimental results
Research questions
- RQ1Does SpatialBoost improve the spatial understanding of pre-trained vision encoders across 2D and 3D tasks?
- RQ2Can language-guided, multi-turn spatial reasoning provide transferable gains without catastrophic forgetting?
- RQ3Which components (LLM decoding, multi-turn reasoning, dual-channel attention) contribute most to performance gains?
Key findings
- SpatialBoost improves 3D-centric tasks across encoders (e.g., DINOv3, SigLIPv2) and benchmarks.
- On ADE20K, DINOv3 with SpatialBoost reaches 59.7% mIoU (up from 55.9%), a 3.8 percentage point gain.
- ImageNet linear probing for DINOv3 rises from 88.4% to 90.2% with SpatialBoost.
- In 3D scene understanding, Lexicon3D SQA3D BLEU-1 improves from 51.4 to 54.9 with SpatialBoost (OpenCLIP).
- Depth estimation on NYUd with SigLIPv2 improves from RMSE 0.51 to 0.39 (linear probing).
- Across tasks, SpatialBoost yields broad gains even for image retrieval/classification, e.g., ImageNet Top-1 for DINOv3 increases from 88.4% to 90.2%.

Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.