QUICK REVIEW

[论文解读] SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Cheng-Yen Yang, Hsiang-Wei Huang|arXiv (Cornell University)|Nov 18, 2024

Video Surveillance and Tracking Methods被引用 9

一句话总结

SAMURAI 在 SAM 2 的基础上通过运动感知记忆和基于卡尔曼滤波的运动建模，实现在不进行再训练的情况下的实时零样本视觉跟踪，并在若干基准上达到最先进的结果。

ABSTRACT

The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_{ ext{ext}}$ and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments.

研究动机与目标

通过将运动线索引入 SAM 2 来提升在具有挑战性视频中的跟踪精度。
通过引入一个偏好相关帧的记忆选择机制来减轻来自记忆的误差传播。
在不进行微调的情况下，在多样化的 VOT 基准上展示强大的零样本泛化能力。
保持适用于在线跟踪场景的实时性能。

提出的方法

引入基于卡尔曼滤波的运动模型，以细化边界框预测，并通过 KF-IoU 得分与原始掩码相似度分数结合来选择最佳掩码。
开发一个运动感知记忆选择机制，根据掩码相似度、目标出现与运动线索的混合评分，从过去帧构建记忆库。
用选择性记忆库替换固定窗口记忆，以在遮挡和形变期间减少记忆相关的误差传播。
在不需要重新训练或微调的情况下，将所提出的组件集成到现有的 SAM 2 架构（memory attention、memory encoder、mask decoder）中。

实验结果

研究问题

RQ1在视觉目标跟踪中，添加运动建模是否能提升 SAM 2 的掩码预测精度？
RQ2运动感知记忆选择能否在长序列中减少误差传播和身份切换？
RQ3与基线方法相比，SAMURAI 在 LaSOT、LaSOT_ext、GOT-10k 以及 TrackingNet/NFS/OTB100 的零样本跟踪表现如何？
RQ4该方法是否能够在无需额外训练的情况下实现实时在线推理？

主要发现

SAMURAI 在 LaSOT_ext 上实现了 7.1% 的 AUC 提升，在 GOT-10k 上实现了 3.5% 的 AO 提升，相较于先前基线。
零样本 SAMURAI 在 LaSOT、LaSOT_ext、GOT-10k 基准上无需训练或微调即可达到最先进的性能。
SAMURAI-L 在 LaSOT 上取得了与全监督方法竞争力的结果，显示在复杂场景中的强泛化能力。
消融实验表明运动建模和记忆选择都对性能提升有贡献，二者的组合达到最佳效果。
在高端 GPU 上，运行时保持实时，开销极小，表明对在线跟踪的实用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。