QUICK REVIEW

[논문 리뷰] SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation

Taewan Cho, Taeryang Kim|arXiv (Cornell University)|2026. 01. 25.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

SPACE-CLIP은 고정된 CLIP 비전 인코더로부터 잠재 기하학 지식을 직접 해석하여 텍스트 프롬프트나 인코더 미세조정 없이 단안 깊이 추정을 수행하는 이중 경로 디코더를 사용합니다.

ABSTRACT

Robotic and autonomous systems need dense spatial cues, but many monocular depth models are heavy, task-specific, or hard to attach to an existing multimodal stack. CLIP offers strong semantic representations, yet most CLIP-based depth methods still depend on text prompts or backbone updates, which complicate deployment in integrated control pipelines. We present SPACE-CLIP, a decoder-only depth framework that reads geometric cues directly from a frozen CLIP vision encoder and bypasses the text encoder at inference time. The model combines FiLM-conditioned semantic features from deep layers with structural features from shallow layers to recover both global scene layout and local geometric detail. Under the TFI-FB constraint (text-free inference and frozen vision backbone), SPACE-CLIP achieves AbsRel 0.0901 on KITTI and 0.1042 on NYU Depth V2, and the same dual-pathway decoder transfers to a frozen SigLIP backbone with comparable results. These findings show that a compact decoder can turn a shared foundation-model backbone into a reusable spatial perception module for embodied AI and autonomous robotic systems. Our model is available at https://github.com/taewan2002/space-clip

연구 동기 및 목표

텍스트 인코더를 사용하지 않고 고정된 비전 인코더(CLIP)로부터 잠재 기하를 직접 해석하여 단안 깊이 추정을 가능하게 한다.
VLA와 같은 구현형 인공지능 시스템의 플러그인으로서 적합한 경량의 통합 가능한 깊이 인지 모듈을 개발한다.
의미론 정보와 구조 정보를 계층적으로 융합하는 이중 경로 Dense Predictor를 제안한다.

제안 방법

다중 레벨 특징을 추출하기 위해 고정된 CLIP ViT-B/16 비전 인코더를 사용한다.
FiLM 조정이 적용된 고수준 특징을 가진 시맨틱 경로와 저수준 특징을 가진 구조 경로를 갖는 Dense Predictor를 도입한다.
FiLM은 MLP 기반 FiLM 생성기를 통해 [CLS] 토큰의 전역 컨텍스트를 사용하여 시맨틱 특징을 조절한다.
각 업샘플링 단계에서 시맨틱 스트림과 구조 스트림을 계층적으로 융합하여 고해상도 깊이 맵을 생성한다.
Scale-Invariant Logarithmic (SILog) 손실과 Structural Similarity (SSIM) 손실을 결합한 복합 손실로 학습한다 (lambda_ssim = 0.5).
KITTI Eigen 분할에서 224x224 CLIP 입력 및 352x704 처리 해상도로 평가한다.

실험 결과

연구 질문

RQ1텍스트 인코더에 의존하지 않고도 고정된 비전 인코더의 잠재 기하 지식을 조밀한 예측 작업에 직접 접근할 수 있는가?
RQ2이중 경로 디코더가 미세한 구조적 세부 정보와 높은 수준의 시맨틱 맥락을 함께 보존하여 정확한 깊이 맵을 생성할 수 있는가?
RQ3시맨틱 스트림과 구조 스트림의 계층적 융합이 미세튜닝 없이 단안 깊이 추정 성능에 어떤 영향을 미치는가?

주요 결과

SPACE-CLIP은 엄격한 무텍스트, 무미세조정 제약 하에서 경쟁력 있는 깊이 추정을 달성하며 이전의 CLIP 기반 방법들을 능가한다.
구성 요소 제거 실험에서는 구조 경로가 미세한 디테일 보존으로 인해 상당한 개선을 가져온다(AbsRel이 0.1165에서 0.1094로 감소).
FiLM 조정은 고수준 시맨틱 특징에 전역 컨텍스트를 주입하여 추가 이득을 제공한다.
전체 SPACE-CLIP 모델(FiLM + Structural Pathway)은 절제된 구성들 중에서 최상의 지표를 달성한다(AbsRel 0.1038, RMSE 4.837).
이 방법은 고정된 기초 모델 특징을 구현형 AI를 위한 모듈형 인지 플러그인으로 사용하는 가능성을 보여준다.
KITTI Eigen 분할에서 SPACE-CLIP은 AbsRel에서 Auty 등과 비교하여 크게 향상된다(0.307에서 0.104로).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.