Skip to main content
QUICK REVIEW

[論文レビュー] SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

Byungwoo Jeon, Dongyoung Kim|arXiv (Cornell University)|Mar 23, 2026
Multimodal Machine Learning Applications被引用数 0
ひとこと要約

SpatialBoost は dense 3D 手掛かりを言語ガイド付きの多ターン推論に変換することで、事前学習済みのビジョンエンコーダを 3D 空間知識で強化し、3D対応タスクと一般的なビジョンタスクの両方で一貫した向上を達成します。

ABSTRACT

Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.

研究の動機と目的

  • Motivate and address the gap in 3D spatial awareness of 2D-trained vision encoders.
  • Inject dense 3D spatial knowledge via language descriptions using an LLM.
  • Preserve pre-trained knowledge while enabling spatial reasoning through a dual-channel attention mechanism.

提案手法

  • Extract dense 3D spatial information from images (depth, 3D reconstruction, segmentation, region captions).
  • Convert spatial information into multi-turn, pixel- to scene-level reasoning with an LLM.
  • Align visual encoder features with the LLM embedding space via staged training (feature alignment, visual instruction tuning).
  • Fine-tune the vision encoder with a dual-channel attention module to inject spatial reasoning while preserving pre-trained knowledge.
Figure 1 : Overview of SpatialBoost. We enhance spatial and geometric understanding of pre-trained vision encoders by leveraging language-guided spatial reasoning. SpatialBoost consists of (a) spatial knowledge extraction through depth estimation, 3D reconstruction, segmentation, and region captioni
Figure 1 : Overview of SpatialBoost. We enhance spatial and geometric understanding of pre-trained vision encoders by leveraging language-guided spatial reasoning. SpatialBoost consists of (a) spatial knowledge extraction through depth estimation, 3D reconstruction, segmentation, and region captioni

実験結果

リサーチクエスチョン

  • RQ1Does SpatialBoost improve the spatial understanding of pre-trained vision encoders across 2D and 3D tasks?
  • RQ2Can language-guided, multi-turn spatial reasoning provide transferable gains without catastrophic forgetting?
  • RQ3Which components (LLM decoding, multi-turn reasoning, dual-channel attention) contribute most to performance gains?

主な発見

  • SpatialBoost はエンコーダ全般(例:DINOv3、SigLIPv2)およびベンチマークで 3D 集中タスクの性能を向上させる。
  • ADE20K では、SpatialBoost を用いた DINOv3 の mIoU が 59.7% に達し、従来の 55.9% から 3.8 ポイント向上。
  • DINOv3 の ImageNet 線形プロービングは SpatialBoost で 88.4% から 90.2% に上昇。
  • 3D シーン理解では Lexicon3D SQA3D BLEU-1 が SpatialBoost(OpenCLIP)で 51.4 から 54.9 に改善。
  • NYUd の深度推定で SigLIPv2 の RMSE が 0.51 から 0.39(線形プロービング)。
  • タスク全般で SpatialBoost は、画像検索/分類にも広範な改善をもたらし、例として DINOv3 の ImageNet Top-1 が 88.4% から 90.2% に増加。)
Figure 2 : Illustration of multi-turn visual spatial reasoning dataset , exhibiting pixel-level, object-level, and scene-level reasoning QAs. At the pixel-level, the QA task queries the 3D positions of points ( e.g . , via depth estimation). At the object-level, it extracts spatial properties of obj
Figure 2 : Illustration of multi-turn visual spatial reasoning dataset , exhibiting pixel-level, object-level, and scene-level reasoning QAs. At the pixel-level, the QA task queries the 3D positions of points ( e.g . , via depth estimation). At the object-level, it extracts spatial properties of obj

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。