QUICK REVIEW

[論文レビュー] SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Cheng-Yen Yang, Hsiang-Wei Huang|arXiv (Cornell University)|Nov 18, 2024

Video Surveillance and Tracking Methods被引用数 9

ひとこと要約

SAMURAI は SAM 2 を、リアルタイムかつゼロショットのビジュアル追跡を再学習なしで実現するために、モーション認識メモリとKalmanフィルターに基づくモーションモデリングを追加し、いくつかのベンチマークで最先端の結果を達成します。

ABSTRACT

The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_{ ext{ext}}$ and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments.

研究の動機と目的

SAM 2 にモーション手がかりを組み込むことで、困難な動画での追跡精度を向上させる。
関連するフレームを優先するメモリ選択機構を導入して、メモリからの誤差伝播を緩和する。
ファインチューニングなしで、多様な VOT ベンチマークに対して強力なゼロショット汎化を実証する。
オンライン追跡シナリオに適したリアルタイム性能を維持する。

提案手法

Kalman Filter ベースのモーションモデルを組み込み、境界ボックス予測を精緻化し、KF-IoU スコアと元のマスク親和度スコアを組み合わせて最良のマスクを選択する。
マスク親和度、オブジェクト出現、およびモーション手掛かりを組み合わせたハイブリッドスコアに基づき、過去フレームからメモリバンクを構築するモーション認識メモリ選択機構を開発する。
固定ウィンドウのメモリを、遮蔽・変形時のメモリ関連誤差伝播を抑制する目的で選択的なメモリバンクに置換する。
再訓練やファインチューニングを必要とせず、提案されたコンポーネントを既存の SAM 2 アーキテクチャ（memory attention、memory encoder、mask decoder）に統合する。

実験結果

リサーチクエスチョン

RQ1視覚オブジェクト追跡のために SAM 2 にモーションモデリングを追加すると、マスク予測精度が向上しますか？
RQ2モーション認識メモリ選択は長いシーケンスでの誤差伝播と同一性スイッチを低減できますか？
RQ3ベースラインと比較して、LaSOT、LaSOT_ext、GOT-10k、TrackingNet/NFS/OTB100 における SAMURAI のゼロショット追跡の性能はどうですか？
RQ4追加トレーニングなしでリアルタイムのオンライン推論が可能ですか？

主な発見

SAMURAI は prior baselines に対して LaSOT_ext で 7.1% の AUC 増、GOT-10k で 3.5% の AO 増を達成。
ゼロショット SAMURAI はトレーニングやファインチューニングなしで LaSOT、LaSOT_ext、GOT-10k ベンチマークで最先端の性能を達成。
SAMURAI-L は LaSOT で fully supervised 手法と競合する結果を達成し、複雑なシーンで強い汎化を示す。
アブレーションは、モーションモデリングとメモリ選択の両方が性能向上に寄与し、それらの組み合わせが最良の結果をもたらすことを示している。
Runtime remains real-time with negligible overhead on a high-end GPU, indicating practicality for online tracking.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。