QUICK REVIEW

[論文レビュー] SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation

Taewan Cho, Taeryang Kim|arXiv (Cornell University)|Jan 25, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

SPACE-CLIPは、凍結されたCLIPビジョンエンコーダーから潜在的な幾何知識を直接解釈し、デュアルパスウェイデコーダーを用いてテキストプロンプトやエンコーダのファインチューニングなしに単眼深度推定を実現する。

ABSTRACT

Robotic and autonomous systems need dense spatial cues, but many monocular depth models are heavy, task-specific, or hard to attach to an existing multimodal stack. CLIP offers strong semantic representations, yet most CLIP-based depth methods still depend on text prompts or backbone updates, which complicate deployment in integrated control pipelines. We present SPACE-CLIP, a decoder-only depth framework that reads geometric cues directly from a frozen CLIP vision encoder and bypasses the text encoder at inference time. The model combines FiLM-conditioned semantic features from deep layers with structural features from shallow layers to recover both global scene layout and local geometric detail. Under the TFI-FB constraint (text-free inference and frozen vision backbone), SPACE-CLIP achieves AbsRel 0.0901 on KITTI and 0.1042 on NYU Depth V2, and the same dual-pathway decoder transfers to a frozen SigLIP backbone with comparable results. These findings show that a compact decoder can turn a shared foundation-model backbone into a reusable spatial perception module for embodied AI and autonomous robotic systems. Our model is available at https://github.com/taewan2002/space-clip

研究の動機と目的

Enable monocular depth estimation by directly interpreting latent geometry from a frozen vision encoder (CLIP) without using the text encoder.
Develop a lightweight, integrable depth perception module suitable as a plugin for VLA-like embodied AI systems.
Propose a dual-pathway Dense Predictor that fuses semantic and structural information hierarchically.

提案手法

Use a frozen CLIP ViT-B/16 vision encoder to extract multi-level features.
Introduce a Dense Predictor with a Semantic Pathway (high-level features with FiLM conditioning) and a Structural Pathway (low-level features).
FiLM modulates semantic features using global context from the [CLS] token via an MLP-based FiLM generator.
Hierarchically fuse semantic and structural streams at each upsampling stage to produce a high-resolution depth map.
Train with a composite loss combining Scale-Invariant Logarithmic (SILog) loss and Structural Similarity (SSIM) loss (lambda_ssim = 0.5).
Evaluate on KITTI Eigen split with 224x224 CLIP input and 352x704 processing resolution.

実験結果

リサーチクエスチョン

RQ1Can latent geometric knowledge in a frozen vision encoder be accessed directly for dense prediction tasks without relying on the text encoder?
RQ2Is a dual-pathway decoder capable of jointly preserving fine structural details and high-level semantic context to produce accurate depth maps?
RQ3How does hierarchical fusion of semantic and structural streams affect monocular depth estimation performance under no-tuning constraints?

主な発見

SPACE-CLIP achieves competitive depth estimation under a strict no-text, no-finetune constraint, outperforming earlier CLIP-based methods.
Ablation shows structural pathway yields substantial improvements (AbsRel from 0.1165 to 0.1094) due to preserving fine-grained details.
FiLM conditioning provides additional gains by injecting global context into high-level semantic features.
The full SPACE-CLIP model (FiLM + Structural Pathway) achieves best metrics (AbsRel 0.1038, RMSE 4.837) among the ablations.
The method demonstrates the feasibility of using frozen foundation-model features as a modular perception plugin for embodied AI.
On KITTI Eigen split, SPACE-CLIP significantly improves over Auty et al. in AbsRel (0.307 to 0.104).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。