QUICK REVIEW

[Paper Review] Segment Anything Meets Point Tracking

Frano Rajič, Lei Ke|arXiv (Cornell University)|Jul 3, 2023

Visual Attention and Saliency Detection38 citations

TL;DR

SAM-PT combines Segment Anything Model (SAM) with long-term point tracking to enable zero-shot interactive video segmentation using sparse query points, achieving strong results on multiple VOS/VIS benchmarks without video data during training.

ABSTRACT

The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, enabled by efficient point-centric annotation and prompt-based models. While click and brush interactions are both well explored in interactive image segmentation, the existing methods on videos focus on mask annotation and propagation. This paper presents SAM-PT, a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions. We release our code that integrates different point trackers and video segmentation benchmarks at https://github.com/SysCV/sam-pt.

Motivation & Objective

Motivate zero-shot interactive video segmentation by leveraging a foundation image segmentation model (SAM) and sparse point prompts.
Develop a point-centric propagation framework that tracks query points through video frames to guide segmentation.
Enable mask refinement and occasional reinitialization to maintain accuracy over long video sequences.
Evaluate SAM-PT across semi-supervised, open-world, fully interactive VOS, and VIS settings on diverse benchmarks.
Highlight practical interactive annotation benefits and zero-shot generalization without video training data.

Proposed method

Extend SAM with long-term point trackers (e.g., PIPS, CoTracker) to propagate positive and negative query points across frames.
Sample initial positive/negative points from the first frame using methods like K-Medoids, Shi-Tomasi, random or mixed sampling; eight positive points per object recommended in ablations.
Prompt SAM in two passes per frame: first with only positive points to localize the object, then with both positive and negative points plus the previous mask for refinement.
Reinitialize query points every horizon (h = 8 frames) by sampling new points from the latest predicted mask to recover from tracking errors and occlusions.

Experimental results

Research questions

RQ1Can sparse point propagation combined with SAM achieve competitive zero-shot video segmentation without any video segmentation training data?
RQ2How do different point sampling strategies and trackers affect zero-shot VOS performance across standard benchmarks?
RQ3Does a two-pass SAM prompting scheme with positive and negative points improve mask quality in video frames?
RQ4What is the impact of reinitializing points on long sequences and challenging scenarios like occlusions?

Key findings

SAM-PT achieves state-of-the-art zero-shot VOS performance on DAVIS 2017 (J&F = 79.4) and DAVIS 2016 (84.3).
On YouTube-VOS 2018, SAM-PT attains the highest zero-shot score among methods, with J&F = 76.2.
SAM-PT outperforms several zero-shot baselines and even some fully supervised VIS methods on UVO.
Ablations show eight positive points per object substantially boost performance (vs. one), and adding negative points plus iterative refinement improves results further.
Reinitializing points every 8 frames and sampling from updated masks helps recover from tracker errors and occlusions, improving robustness across datasets.
SAM-PT demonstrates strong cross-dataset generalization, performing well on DAVIS, YouTube-VOS, MOSE, and BDD100K in zero-shot or interactive settings.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.