QUICK REVIEW

[논문 리뷰] SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

Byungwoo Jeon, Dongyoung Kim|arXiv (Cornell University)|2026. 03. 23.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

SpatialBoost는 Dense 3D 정보를 언어 기반의 다회 추론으로 변환하여 LLM을 통해 사전 학습된 비전 인코더의 3D 공간 지식을 강화하고, 3D 인식 태스크와 일반 비전 태스크 전반에서 일관된 이득을 달성합니다.

ABSTRACT

Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.

연구 동기 및 목표

2D로 학습된 비전 인코더의 3D 공간 인식 격차를 동기 부여하고 해결합니다.
LLM을 사용하여 언어 설명을 통해 밀도 높은 3D 공간 지식을 주입합니다.
사전 학습 지식을 보존하면서 이중 채널 어텐션 메커니즘을 통해 공간 추론을 가능하게 합니다.

제안 방법

이미지에서 Dense 3D 공간 정보를 추출합니다(깊이, 3D 재구성, 세분화, 영역 캡션).
공간 정보를 다회, 픽셀-에서 장면 수준의 추론으로 LLM과 변환합니다.
단계적 학습(특성 정렬, 시각 지시문 조정)을 통해 시각 인코더 특성과 LLM 임베딩 공간을 맞춥니다.
사전 학습 지식을 보존하면서 공간 추론을 주입하기 위해 이중 채널 어텐션 모듈로 비전 인코더를 미세 조정합니다.

Figure 1 : Overview of SpatialBoost. We enhance spatial and geometric understanding of pre-trained vision encoders by leveraging language-guided spatial reasoning. SpatialBoost consists of (a) spatial knowledge extraction through depth estimation, 3D reconstruction, segmentation, and region captioni

실험 결과

연구 질문

RQ1SpatialBoost가 2D 및 3D 태스크 전반에서 사전 학습된 비전 인코더의 공간 이해를 향상시키나요?
RQ2언어 가이드 기반의 다회 추론이 재앙적 망각 없이 이식 가능한 이득을 제공할 수 있나요?
RQ3어떤 구성요소(LLM 디코딩, 다회 추론, 이중 채널 어텐션)가 성능 향상에 가장 큰 기여를 하나요?

주요 결과

SpatialBoost는 인코더(DINOv3, SigLIPv2 등) 및 벤치마크에서 3D 중심 태스크의 성능을 개선합니다.
ADE20K에서 DINOv3의 SpatialBoost 적용 시 mIoU가 59.7%에 도달(기준 55.9%에서 3.8 포인트 향상).
DINOv3의 ImageNet 선형 탐색 점수가 SpatialBoost로 88.4%에서 90.2%로 상승합니다.
3D 장면 이해에서 Lexicon3D SQA3D BLEU-1이 SpatialBoost(OpenCLIP)로 51.4에서 54.9로 향상됩니다.
NYUd에서 SigLIPv2의 깊이 추정이 RMSE 0.51에서 0.39로 개선됩니다(선형 탐색).
다양한 태스크에서 SpatialBoost가 이미지 검색/분류에도 광범위한 이득을 제공합니다. 예를 들어 DINOv3의 ImageNet Top-1가 88.4%에서 90.2%로 증가합니다.

Figure 2 : Illustration of multi-turn visual spatial reasoning dataset , exhibiting pixel-level, object-level, and scene-level reasoning QAs. At the pixel-level, the QA task queries the 3D positions of points ( e.g . , via depth estimation). At the object-level, it extracts spatial properties of obj

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.