[Paper Review] Segment Anything Meets Point Tracking
SAM-PT combines Segment Anything Model (SAM) with long-term point tracking to enable zero-shot interactive video segmentation using sparse query points, achieving strong results on multiple VOS/VIS benchmarks without video data during training.
The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, enabled by efficient point-centric annotation and prompt-based models. While click and brush interactions are both well explored in interactive image segmentation, the existing methods on videos focus on mask annotation and propagation. This paper presents SAM-PT, a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions. We release our code that integrates different point trackers and video segmentation benchmarks at https://github.com/SysCV/sam-pt.
Motivation & Objective
- Motivate zero-shot interactive video segmentation by leveraging a foundation image segmentation model (SAM) and sparse point prompts.
- Develop a point-centric propagation framework that tracks query points through video frames to guide segmentation.
- Enable mask refinement and occasional reinitialization to maintain accuracy over long video sequences.
- Evaluate SAM-PT across semi-supervised, open-world, fully interactive VOS, and VIS settings on diverse benchmarks.
- Highlight practical interactive annotation benefits and zero-shot generalization without video training data.
Proposed method
- Extend SAM with long-term point trackers (e.g., PIPS, CoTracker) to propagate positive and negative query points across frames.
- Sample initial positive/negative points from the first frame using methods like K-Medoids, Shi-Tomasi, random or mixed sampling; eight positive points per object recommended in ablations.
- Prompt SAM in two passes per frame: first with only positive points to localize the object, then with both positive and negative points plus the previous mask for refinement.
- Reinitialize query points every horizon (h = 8 frames) by sampling new points from the latest predicted mask to recover from tracking errors and occlusions.
Experimental results
Research questions
- RQ1Can sparse point propagation combined with SAM achieve competitive zero-shot video segmentation without any video segmentation training data?
- RQ2How do different point sampling strategies and trackers affect zero-shot VOS performance across standard benchmarks?
- RQ3Does a two-pass SAM prompting scheme with positive and negative points improve mask quality in video frames?
- RQ4What is the impact of reinitializing points on long sequences and challenging scenarios like occlusions?
Key findings
- SAM-PT achieves state-of-the-art zero-shot VOS performance on DAVIS 2017 (J&F = 79.4) and DAVIS 2016 (84.3).
- On YouTube-VOS 2018, SAM-PT attains the highest zero-shot score among methods, with J&F = 76.2.
- SAM-PT outperforms several zero-shot baselines and even some fully supervised VIS methods on UVO.
- Ablations show eight positive points per object substantially boost performance (vs. one), and adding negative points plus iterative refinement improves results further.
- Reinitializing points every 8 frames and sampling from updated masks helps recover from tracker errors and occlusions, improving robustness across datasets.
- SAM-PT demonstrates strong cross-dataset generalization, performing well on DAVIS, YouTube-VOS, MOSE, and BDD100K in zero-shot or interactive settings.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.