QUICK REVIEW

[论文解读] Segment and Track Anything

Yangming Cheng, Liulei Li|arXiv (Cornell University)|May 11, 2023

AI in Service Interactions被引用 79

一句话总结

SAM-Track 通过将 SAM、DeAOT 与 Grounding-DINO 结合，在视频中统一分割和跟踪，支持多对象在跨帧的交互式多模态跟踪与自动跟踪。

ABSTRACT

This report presents a framework called Segment And Track Anything (SAMTrack) that allows users to precisely and effectively segment and track any object in a video. Additionally, SAM-Track employs multimodal interaction methods that enable users to select multiple objects in videos for tracking, corresponding to their specific requirements. These interaction methods comprise click, stroke, and text, each possessing unique benefits and capable of being employed in combination. As a result, SAM-Track can be used across an array of fields, ranging from drone technology, autonomous driving, medical imaging, augmented reality, to biological analysis. SAM-Track amalgamates Segment Anything Model (SAM), an interactive key-frame segmentation model, with our proposed AOT-based tracking model (DeAOT), which secured 1st place in four tracks of the VOT 2022 challenge, to facilitate object tracking in video. In addition, SAM-Track incorporates Grounding-DINO, which enables the framework to support text-based interaction. We have demonstrated the remarkable capabilities of SAM-Track on DAVIS-2016 Val (92.0%), DAVIS-2017 Test (79.2%)and its practicability in diverse applications. The project page is available at: https://github.com/z-x-yang/Segment-and-Track-Anything.

研究动机与目标

提供一个统一的视频分割框架，能够处理多种交互模式并在跨帧进行对象跟踪。
利用 SAM 实现交互式关键帧分割，利用 DeAOT 实现快速多对象跟踪。
结合 Grounding-DINO，实现基于自然语言的对象选择和开放集检测。
启用两种跟踪模式（交互式和自动式）以及融合模式，以适应现实世界中的灵活用例。

提出的方法

将 SAM 与 DeAOT 集成，在多对象设置中传播分割和 ID 跨帧。
使用 Grounding-DINO 提供基于语言的对象提示并获得用于分割的边界框。
引入 Segment Everything 和 Object of Interest Segmentation，在自动模式下初始化并检测新对象。
定义 Comparing Mask Results (CMR) 机制，以检测真正的新对象并在跟踪过程中避免 ID 冲突。
提供融合跟踪模式，将交互式和自动跟踪模式结合起来。
在 DAVIS-2016-Val 和 DAVIS-2017-Test 上进行评估，并与最先进方法进行定量对比。

实验结果

研究问题

RQ1Can SAM-Track track and segment any object in a video with high accuracy under interactive prompts?
RQ2How does the system perform multi-object tracking with temporal coherence using DeAOT across frames?
RQ3Can Grounding-DINO enable effective language-based object selection within this segmentation-tracking pipeline?
RQ4How can automatic mode detect and incorporate new objects appearing later in a video without disturbing existing IDs?
RQ5What are the comparative gains over existing VOS methods on standard benchmarks?

主要发现

方法	初始化	平均值	J	F	平均值	J	F
CFBI	Mask	89.4	88.3	90.5	75.6	71.6	79.6
CFBI+	Mask	89.9	88.7	91.1	78.0	74.4	81.6
MiVOS	Scribble	91.0	89.6	92.4	78.6	74.9	82.2
STCN	Mask	91.6	90.8	92.5	76.1	72.7	79.6
R50-AOT-L	Mask	91.1	90.1	92.1	79.6	75.9	83.3
XMem	Mask	92.0	90.7	93.2	81.2	77.6	84.7
R50-DeAOT-L	Mask	92.3	90.5	94.0	80.7	76.9	84.5
SwinB-DeAOT-L	Mask	92.9	91.1	94.7	82.8	78.9	86.7
SAM-Track(Ours)	Click	92.0	90.3	93.6	79.2	75.3	83.1

SAM-Track achieves strong performance on DAVIS-2016-Val with 92.0 average, 90.3 J, and 93.6 F using interactive clicks.
On DAVIS-2017-Test, SAM-Track records 79.2 average, 75.3 J, and 83.1 F with the same setup.
The method outperforms several baselines and matches or surpasses recent DeAOT-based variants in multi-object tracking tasks.
Two flexible tracking modes (interactive and automatic) and a fusion mode enable versatile deployment across domains like sports analysis, medical imaging, and autonomous driving.
Grounding-DINO integration enables natural language prompts to guide object selection, broadening open-set detection capabilities.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。