Skip to main content
QUICK REVIEW

[논문 리뷰] Towards Label-free Scene Understanding by Vision Foundation Models

Runnan Chen, Youquan Liu|arXiv (Cornell University)|2023. 06. 06.
Multimodal Machine Learning Applications인용 수 15
한 줄 요약

The paper introduces Cross-modality Noisy Supervision (CNS) to enable 2D and 3D label-free semantic segmentation by leveraging CLIP and SAM, with empirical gains on ScanNet, nuImages, and nuScenes.

ABSTRACT

Vision foundation models such as Contrastive Vision-Language Pre-training (CLIP) and Segment Anything (SAM) have demonstrated impressive zero-shot performance on image classification and segmentation tasks. However, the incorporation of CLIP and SAM for label-free scene understanding has yet to be explored. In this paper, we investigate the potential of vision foundation models in enabling networks to comprehend 2D and 3D worlds without labelled data. The primary challenge lies in effectively supervising networks under extremely noisy pseudo labels, which are generated by CLIP and further exacerbated during the propagation from the 2D to the 3D domain. To tackle these challenges, we propose a novel Cross-modality Noisy Supervision (CNS) method that leverages the strengths of CLIP and SAM to supervise 2D and 3D networks simultaneously. In particular, we introduce a prediction consistency regularization to co-train 2D and 3D networks, then further impose the networks' latent space consistency using the SAM's robust feature representation. Experiments conducted on diverse indoor and outdoor datasets demonstrate the superior performance of our method in understanding 2D and 3D open environments. Our 2D and 3D network achieves label-free semantic segmentation with 28.4\% and 33.5\% mIoU on ScanNet, improving 4.7\% and 7.9\%, respectively. For nuImages and nuScenes datasets, the performance is 22.1\% and 26.8\% with improvements of 3.5\% and 6.0\%, respectively. Code is available. (https://github.com/runnanchen/Label-Free-Scene-Understanding).

연구 동기 및 목표

  • Motivate label-free scene understanding for 2D and 3D in open-world environments.
  • Leverage vision foundation models CLIP (classification) and SAM (segmentation) to generate noisy pseudo labels.
  • Develop a joint framework to supervise 2D and 3D networks despite label noise.
  • Use SAM-derived latent-space regularization to align and stabilize multi-modal representations.
  • Demonstrate state-of-the-art label-free segmentation on indoor and outdoor datasets.

제안 방법

  • Pseudo-label 2D pixels with CLIP and transfer to 3D points via calibration matrix.
  • Refine CLIP-derived pseudo-labels with SAM masks to improve supervision quality.
  • Train 2D and 3D networks with prediction consistency regularization by randomly switching pseudo-labels across modalities.
  • Impose latent-space consistency by aligning 2D/3D features with frozen SAM feature space using a cosine-similarity loss (L_f).
  • Two-stage training: stage one trains with refined labels; stage two introduces self- and cross-training using multiple pseudo-label sources.
  • Backbones: MinkowskiNet34 for 3D and DeeplabV3 for 2D; CLIP attention pooling modified for dense pixel-labeling.

실험 결과

연구 질문

  • RQ1Can vision foundation models enable open-world, label-free 2D and 3D scene understanding?
  • RQ2How can CLIP and SAM be combined to produce robust, noisy pseudo labels for cross-modal supervision?
  • RQ3Does co-training 2D and 3D networks with switched pseudo-labels mitigate error propagation from label noise?
  • RQ4Can latent-space alignment with SAM features improve segmentation boundaries in label-free settings?
  • RQ5How does the proposed CNS framework perform on indoor (ScanNet) and outdoor (nuScenes, nuImages) datasets without labeled data?

주요 결과

  • The proposed CNS framework achieves label-free semantic segmentation of 2D and 3D data, outperforming prior label-free methods on ScanNet and nuScenes.
  • On ScanNet, 2D and 3D mIoU achieve 28.4% and 33.5% respectively, improving by 4.7% and 7.9% over prior methods.
  • On nuImages and nuScenes, 2D and 3D results are 22.1% and 26.8% mIoU respectively, with improvements of 3.5% and 6.0% over baselines.
  • Ablation studies show SAM-based label refinement, prediction consistency regularization, and latent-space consistency with SAM features are critical for performance.
  • Full configuration (CNS with all components) yields the best label-free 2D/3D segmentation results across evaluated datasets.
  • Qualitative results demonstrate the method’s ability to segment many open-world objects without labels, approaching human-like performance in several cases.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.