QUICK REVIEW

[논문 리뷰] SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Cheng-Yen Yang, Hsiang-Wei Huang|arXiv (Cornell University)|2024. 11. 18.

Video Surveillance and Tracking Methods인용 수 9

한 줄 요약

SAMURAI는 motion-aware 메모리와 Kalman-filter 기반의 모션 모델링으로 SAM 2를 확장하여 재학습 없이 실시간 제로샷 비주얼 트래킹을 가능하게 하며, 여러 벤치마크에서 최첨단 성능을 달성합니다.

ABSTRACT

The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_{ ext{ext}}$ and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments.

연구 동기 및 목표

challenging videos에서 모션 큐를 SAM 2에 도입해 트래킹 정확도 향상
메모리에서의 에러 전파를 완화하기 위해 관련 프레임을 우선시하는 메모리 선택 메커니즘 도입
미세조정 없이 다양한 VOT 벤치마크에서 강력한 제로샷 일반화 입증
온라인 트래킹 시나리오에 적합한 실시간 퍼포먼스 유지

제안 방법

Bounding-box 예측을 다듬고 원래 마스크 친화도 점수와 결합된 KF-IoU 점수를 사용해 최상의 마스크를 선택하기 위한 Kalman Filter 기반 모션 모델 도입
하이브리드 점수(마스크 친화도, 객체 발생 여부, 모션 큐)를 결합해 과거 프레임으로부터 메모리 뱅크를 구축하는 모션 인식 메모리 선택 메커니즘 개발
고정 윈도우 메모리를 occlusion 및 변형 중 메모리 관련 에러 전파를 줄이기 위해 선택적 메모리 뱅크로 대체
retraining이나 fine-tuning 없이 기존 SAM 2 아키텍처(메모리 어텐션, 메모리 인코더, 마스크 디코더)에 제안된 구성요소를 통합

실험 결과

연구 질문

RQ1모션 모델링을 추가하는 것이 시각 물체 추적을 위한 SAM 2의 마스크 예측 정확도를 향상시키는가?
RQ2모션 인식 메모리 선택이 긴 시퀀스에서 에러 전파와 신원 전환을 줄일 수 있는가?
RQ3LaSOT, LaSOT_ext, GOT-10k, 그리고 TrackingNet/NFS/OTB100에서 바탕Baseline 대비 제로샷 SAMURAI의 성능은 어떠한가?
RQ4추가 학습 없이 실시간 온라인 추론이 가능한가?

주요 결과

SAMURAI는 LaSOT_ext에서 7.1% AUC 증가, GOT-10k에서 3.5% AO 증가를 이전 Baselines 대비 달성
Zero-shot SAMURAI는 학습이나 미세조정 없이 LaSOT, LaSOT_ext, GOT-10k 벤치마크에서 최첨단 성능을 달성
SAMURAI-L은 LaSOT에서 fully supervised 방법들과 경쟁력 있는 결과를 달성하며 복잡한 장면에서 강력한 일반화를 보여줌
모션 모델링과 메모리 선택 모두 성능 향상에 기여하며, 이들의 조합이 최상의 결과를 낳음
고성능 GPU에서의 런타임은 실시간으로 유지되며 오버헤드가 미미하여 온라인 트래킹에 실용적임

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.