QUICK REVIEW

[논문 리뷰] TrTr: Visual Tracking with Transformer

Moju Zhao, Kei Okada|arXiv (Cornell University)|2021. 05. 09.

Video Surveillance and Tracking Methods참고 문헌 54인용 수 73

한 줄 요약

TrTr은 교차상관 대신 자기-주의/교차-주의를 활용한 Transformer 인코더-디코더 아키텍처를 도입하여 전역 컨텍스트 의존성을 포착하고, 재현성 강화를 위해 온라인 업데이트 모듈을 추가합니다.

ABSTRACT

Template-based discriminative trackers are currently the dominant tracking methods due to their robustness and accuracy, and the Siamese-network-based methods that depend on cross-correlation operation between features extracted from template and search images show the state-of-the-art tracking performance. However, general cross-correlation operation can only obtain relationship between local patches in two feature maps. In this paper, we propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture to gain global and rich contextual interdependencies. In this new architecture, features of the template image is processed by a self-attention module in the encoder part to learn strong context information, which is then sent to the decoder part to compute cross-attention with the search image features processed by another self-attention module. In addition, we design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor. We extensively evaluate our tracker TrTr, on VOT2018, VOT2019, OTB-100, UAV, NfS, TrackingNet, and LaSOT benchmarks and our method performs favorably against state-of-the-art algorithms. Training code and pretrained models are available at https://github.com/tongtybj/TrTr.

연구 동기 및 목표

로컬 교차상관을 넘어 전역 컨텍스트를 포착하여 추적 강인성과 정확도 향상 동기를 부여합니다.
추적을 위해 대상 분류와 경계 상자 회귀를 모두 수행하는 Transformer 기반 아키텍처를 제시합니다.
추적 중Appearance 변화에 적응하기 위한 온라인 업데이트 모듈을 도입합니다.
주요 벤치마크에서 평가하여 경쟁력을 입증하고 실시간 속도를 달성합니다.

제안 방법

템플릿 특징을 자기-주의로 처리하기 위해 Transformer 인코더를 사용합니다.
템플릿 특징과의 교차-주의를 포함한 자기-주의를 처리하기 위해 검색 특징을 처리하는 Transformer 디코더를 사용합니다.
전통적인 교차상관을 다중헤드 주의로 대체하여 전역 관계를 모델링합니다.
분류와 회귀를 위한 형태에 구애받지 않는 앵커 기반 헤드를 적용합니다.
추적 중 분류를 적응시키기 위한 온라인 업데이트 브랜치를 도입합니다.
대상 대규모 비디오 데이터셋에서 포컬 로스(classification)와 L1 기반 회귀 손실로 엔드-투-엔드 학습합니다.

실험 결과

연구 질문

RQ1Transformer 기반 주의 메커니즘이 교차상관보다 더 넓은 전역 맥락 추론을 통해 추적 정확도와 강인성을 개선할 수 있는가?
RQ2폼-에지(shape-agnostic) 앵커 기반 회귀 헤드가Appearance 변화나 방해물에서 로컬라이제이션을 개선하는가?
RQ3온라인 업데이트 모듈의 추가가 추적 성능과 강인성에 미치는 영향은 무엇인가?
RQ4축소된 Transformer 깊이(1 encoder + 1 decoder)가 추적에서 성능과 속도에 미치는 영향은?
RQ5이 접근법이 벤치마크에서 SOTA 기반의 Siamese 추적기와 경쟁하면서 실시간 추적을 달성할 수 있는가?

주요 결과

TrTr-offline은 VOT2018/2019에서 강한 정확도와 강인성을 달성했고, 정확도 측면에서 다수의 Siamese 기반 추적기를 능가합니다.
온라인 업데이트 모듈(TrTr-online)을 추가하면 오프라인만 사용할 때보다 VOT 벤치마크에서 EAO가 크게 향상됩니다.
OTB-100에서 TrTr-online은 평가된 방법들 중 가장 높은 AUC를 달성합니다.
UAV123 및 NfS에서 TrTr-online은 상위 방법들 중 하나로 랭크되며, 여러 베이스라인 대비 눈에 띄는 이점을 보입니다.
TrackingNet과 LaSOT에서 TrTr은 경쟁력 있는 성능을 보이나 대규모 데이터셋에서 개선 여지가 있습니다.
모델은 실시간으로 실행되며, 오프라인 약 50 FPS, 온라인 업데이트를 통합할 때 약 35 FPS를 달성합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.