QUICK REVIEW

[論文レビュー] Reading Relevant Feature from Global Representation Memory for Visual Object Tracking

Xinyu Zhou, Pinxue Guo|arXiv (Cornell University)|Feb 22, 2024

Advanced Image and Video Retrieval Techniques被引用数 5

ひとこと要約

本論文はRFGMを提案する。グローバル表現メモリと関連性注意機構を用いて現在の探索領域に最も関連する過去の特徴のみを読み取り、適応性と速度を向上させる。五つのベンチマークで約71 FPSで競争力のある結果を達成する。

ABSTRACT

Reference features from a template or historical frames are crucial for visual object tracking. Prior works utilize all features from a fixed template or memory for visual object tracking. However, due to the dynamic nature of videos, the required reference historical information for different search regions at different time steps is also inconsistent. Therefore, using all features in the template and memory can lead to redundancy and impair tracking performance. To alleviate this issue, we propose a novel tracking paradigm, consisting of a relevance attention mechanism and a global representation memory, which can adaptively assist the search region in selecting the most relevant historical information from reference features. Specifically, the proposed relevance attention mechanism in this work differs from previous approaches in that it can dynamically choose and build the optimal global representation memory for the current frame by accessing cross-frame information globally. Moreover, it can flexibly read the relevant historical information from the constructed memory to reduce redundancy and counteract the negative effects of harmful information. Extensive experiments validate the effectiveness of the proposed method, achieving competitive performance on five challenging datasets with 71 FPS.

研究の動機と目的

Appearance and background changes による頑健な追跡を動機づけ、すべてのメモリ特徴を使用することによる冗長性を回避する。
グローバル表現メモリ（GR memory）を提案し、トークンレベルでビデオ全体の代表的なターゲット特徴を格納する。
現在のフレームに対して最も関連性の高い過去のトークンを読み取り、それに応じてGR memoryを更新する関連性注意機構を開発する。
選択的な読み取りとトークンレベルのメモリ更新が、ベンチマーク全体で追跡精度と速度を向上させることを示す。

提案手法

フレーム間のグローバル情報を読み取り、現在のフレームに最適なトークンを選択する関連性注意機構を導入する。
新しいテンプレートから既存のメモリへトークンを選択的に統合することにより、トークンの関連性に基づいてグローバル表現メモリ（GR memory）を構築する。
適応的ランキングとGumbel-Softmaxを用いた微分可能な選択に導かれたTop-kトークン選択で、トークンレベルの更新を行う。
選択深さで関連性注意層を有するViTベースのエンコーダと、スコア・オフセット・サイズ予測の三分岐デコーダを用いる。
スコアリングには focal loss、局在化にはL1およびGIoU loss、メモリトークンの保持を規制する比率損失の組み合わせで訓練する。

実験結果

リサーチクエスチョン

RQ1視覚追跡において、ある探索領域に対して最も関連性の高い過去の特徴をどのように特定し、読み取ることができるか？
RQ2長期的なターゲット外観を捉えつつ、メモリの混雑と誤差蓄積を回避するために、トークンレベルのグローバルメモリを維持できるか？
RQ3関連性ベースのメモリ更新は、固定テンプレートメモリ戦略と比較して追跡の頑健性と速度を向上させるか？

主な発見

RFGMはTrackingNet、GOT-10k、LaSOT、OTB、UAV123のベンチマークで競争力のある結果を達成する。
モデルは71 FPSで動作し、リアルタイム追跡の高い効率を実証する。
GR memoryはビデオ全体の代表的なターゲットトークンを格納し、固定テンプレート更新と比較して誤差蓄積を低減する。
関連性注意はメモリからの読み取りを選択的に行い、パラメータ増加がほとんどない状態でメモリ削減を可能にすることで、標準的な注意よりも優れている。
アブレーションでは、適応的トークンランキングを伴うGR memoryが最も良い総合性能を示し、メモリサイズは約192トークンが最適である。
関連性注意を用いると、単純な注意と比べてMACsを削減しつつ性能を維持できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。