QUICK REVIEW

[Paper Review] SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

Byungwoo Jeon, Dongyoung Kim|arXiv (Cornell University)|Mar 23, 2026

Multimodal Machine Learning Applications0 citations

TL;DR

SpatialBoost augments pre-trained vision encoders with 3D spatial knowledge by converting dense 3D cues into language-guided, multi-turn reasoning using an LLM, achieving consistent gains across 3D-aware and general vision tasks.

ABSTRACT

Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.

Motivation & Objective

Motivate and address the gap in 3D spatial awareness of 2D-trained vision encoders.
Inject dense 3D spatial knowledge via language descriptions using an LLM.
Preserve pre-trained knowledge while enabling spatial reasoning through a dual-channel attention mechanism.

Proposed method

Extract dense 3D spatial information from images (depth, 3D reconstruction, segmentation, region captions).
Convert spatial information into multi-turn, pixel- to scene-level reasoning with an LLM.
Align visual encoder features with the LLM embedding space via staged training (feature alignment, visual instruction tuning).
Fine-tune the vision encoder with a dual-channel attention module to inject spatial reasoning while preserving pre-trained knowledge.

Figure 1 : Overview of SpatialBoost. We enhance spatial and geometric understanding of pre-trained vision encoders by leveraging language-guided spatial reasoning. SpatialBoost consists of (a) spatial knowledge extraction through depth estimation, 3D reconstruction, segmentation, and region captioni

Experimental results

Research questions

RQ1Does SpatialBoost improve the spatial understanding of pre-trained vision encoders across 2D and 3D tasks?
RQ2Can language-guided, multi-turn spatial reasoning provide transferable gains without catastrophic forgetting?
RQ3Which components (LLM decoding, multi-turn reasoning, dual-channel attention) contribute most to performance gains?

Key findings

SpatialBoost improves 3D-centric tasks across encoders (e.g., DINOv3, SigLIPv2) and benchmarks.
On ADE20K, DINOv3 with SpatialBoost reaches 59.7% mIoU (up from 55.9%), a 3.8 percentage point gain.
ImageNet linear probing for DINOv3 rises from 88.4% to 90.2% with SpatialBoost.
In 3D scene understanding, Lexicon3D SQA3D BLEU-1 improves from 51.4 to 54.9 with SpatialBoost (OpenCLIP).
Depth estimation on NYUd with SigLIPv2 improves from RMSE 0.51 to 0.39 (linear probing).
Across tasks, SpatialBoost yields broad gains even for image retrieval/classification, e.g., ImageNet Top-1 for DINOv3 increases from 88.4% to 90.2%.

Figure 2 : Illustration of multi-turn visual spatial reasoning dataset , exhibiting pixel-level, object-level, and scene-level reasoning QAs. At the pixel-level, the QA task queries the 3D positions of points ( e.g . , via depth estimation). At the object-level, it extracts spatial properties of obj

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.