Skip to main content
QUICK REVIEW

[Paper Review] Segment Anything Meets Point Tracking

Frano Rajič, Lei Ke|arXiv (Cornell University)|Jul 3, 2023
Visual Attention and Saliency Detection38 citations
TL;DR

SAM-PT combines Segment Anything Model (SAM) with long-term point tracking to enable zero-shot interactive video segmentation using sparse query points, achieving strong results on multiple VOS/VIS benchmarks without video data during training.

ABSTRACT

The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, enabled by efficient point-centric annotation and prompt-based models. While click and brush interactions are both well explored in interactive image segmentation, the existing methods on videos focus on mask annotation and propagation. This paper presents SAM-PT, a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions. We release our code that integrates different point trackers and video segmentation benchmarks at https://github.com/SysCV/sam-pt.

Motivation & Objective

  • Motivate zero-shot interactive video segmentation by leveraging a foundation image segmentation model (SAM) and sparse point prompts.
  • Develop a point-centric propagation framework that tracks query points through video frames to guide segmentation.
  • Enable mask refinement and occasional reinitialization to maintain accuracy over long video sequences.
  • Evaluate SAM-PT across semi-supervised, open-world, fully interactive VOS, and VIS settings on diverse benchmarks.
  • Highlight practical interactive annotation benefits and zero-shot generalization without video training data.

Proposed method

  • Extend SAM with long-term point trackers (e.g., PIPS, CoTracker) to propagate positive and negative query points across frames.
  • Sample initial positive/negative points from the first frame using methods like K-Medoids, Shi-Tomasi, random or mixed sampling; eight positive points per object recommended in ablations.
  • Prompt SAM in two passes per frame: first with only positive points to localize the object, then with both positive and negative points plus the previous mask for refinement.
  • Reinitialize query points every horizon (h = 8 frames) by sampling new points from the latest predicted mask to recover from tracking errors and occlusions.

Experimental results

Research questions

  • RQ1Can sparse point propagation combined with SAM achieve competitive zero-shot video segmentation without any video segmentation training data?
  • RQ2How do different point sampling strategies and trackers affect zero-shot VOS performance across standard benchmarks?
  • RQ3Does a two-pass SAM prompting scheme with positive and negative points improve mask quality in video frames?
  • RQ4What is the impact of reinitializing points on long sequences and challenging scenarios like occlusions?

Key findings

  • SAM-PT achieves state-of-the-art zero-shot VOS performance on DAVIS 2017 (J&F = 79.4) and DAVIS 2016 (84.3).
  • On YouTube-VOS 2018, SAM-PT attains the highest zero-shot score among methods, with J&F = 76.2.
  • SAM-PT outperforms several zero-shot baselines and even some fully supervised VIS methods on UVO.
  • Ablations show eight positive points per object substantially boost performance (vs. one), and adding negative points plus iterative refinement improves results further.
  • Reinitializing points every 8 frames and sampling from updated masks helps recover from tracker errors and occlusions, improving robustness across datasets.
  • SAM-PT demonstrates strong cross-dataset generalization, performing well on DAVIS, YouTube-VOS, MOSE, and BDD100K in zero-shot or interactive settings.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.