QUICK REVIEW

[논문 리뷰] TransCenter: Transformers with Dense Queries for Multiple-Object Tracking

Yihong Xu, Yutong Ban|arXiv (Cornell University)|2021. 07. 22.

Video Surveillance and Tracking Methods참고 문헌 69인용 수 75

한 줄 요약

TransCenter는 이미지 관련 밀집 탐지 쿼리와 희소 추적 쿼리를 사용하는 중심 기반 Transformer MOT 프레임워크를 도입하여 MOT 벤치마크에서 최첨단 결과를 달성합니다.

ABSTRACT

Transformers have proven superior performance for a wide variety of tasks since they were introduced. In recent years, they have drawn attention from the vision community in tasks such as image classification and object detection. Despite this wave, an accurate and efficient multiple-object tracking (MOT) method based on transformers is yet to be designed. We argue that the direct application of a transformer architecture with quadratic complexity and insufficient noise-initialized sparse queries - is not optimal for MOT. We propose TransCenter, a transformer-based MOT architecture with dense representations for accurately tracking all the objects while keeping a reasonable runtime. Methodologically, we propose the use of image-related dense detection queries and efficient sparse tracking queries produced by our carefully designed query learning networks (QLN). On one hand, the dense image-related detection queries allow us to infer targets' locations globally and robustly through dense heatmap outputs. On the other hand, the set of sparse tracking queries efficiently interacts with image features in our TransCenter Decoder to associate object positions through time. As a result, TransCenter exhibits remarkable performance improvements and outperforms by a large margin the current state-of-the-art methods in two standard MOT benchmarks with two tracking settings (public/private). TransCenter is also proven efficient and accurate by an extensive ablation study and comparisons to more naive alternatives and concurrent works. For scientific interest, the code is made publicly available at https://github.com/yihongxu/transcenter.

연구 동기 및 목표

희소 쿼리의 간극과 혼잡한 장면에서의 과다 검출로부터의 간극을 피하는 Transformer 기반 MOT 방법의 동기부여 및 설계.
글로벌하고 견고한 탐지를 제공하기 위해 이미지 관련 밀집 탐지 쿼리 도입.
프레임 간 객체를 효율적으로 연관시키기 위해 희소 추적 쿼리와 특수 디코더를 개발.
밀집 쿼리로도 효율적인 MOT를 달성하기 위해 계산 복잡도 감소.
정확도와 효율의 균형을 맞추기 위한 변형(TransCenter, TransCenter-Dual, TransCenter-Lite) 제공.

제안 방법

연속 프레임에서 다중 스케일 밀집 메모리를 생성하기 위해 가중치 공유 트랜스포머 인코더(PVT 기반)을 사용한다.
인코더 메모리에서 밀집 탐지 쿼리와 희소 추적 쿼리를 생성하기 위해 Query Learning Networks (QLN)을 도입한다.
TDCA(Tracking)와 DDCA(Detection)에 Deformable Cross-Attention으로 구성된 TransCenter 디코더를 사용한다.
이전 프레임 위치를 이용해 희소 추적 쿼리로 시간에 따른 객체 변위 계산.
출력 분기는 중심 히트맵, 객체 크기, 추적 변위를 계산한다; 탐지를 위한 중심 히트맵과 시간적으로 객체를 연관시키는 추적 분기를 사용한다.
중심 히트맵 포컬 로스, 크기에 대한 희소 회귀 손실, 추적 손실, 그리고 전체 가중 손실로 학습한다.

실험 결과

연구 질문

RQ1밀집 이미지 관련 탐지 쿼리를 갖춘 트랜스포머 기반 MOT 모델이 희소 쿼리 DETR 기반 MOT 접근법보다 우수한가?
RQ2밀집 탐지 쿼리를 희소 추적 쿼리와 분리하는 것이 혼잡한 장면에서 탐지 견고성과 추적 효율성을 개선하는가?
RQ3다양한 QLN 및 디코더 설계가 MOT 정확도와 효율성에 어떤 영향을 미치는가?
RQ4효율적인 인코더(PVT)와 변형 가능 주의(attention) 사용이 MOT 런타임 및 성능에 어떤 영향을 미치는가?

주요 결과

TransCenter는 MOT17에서 +4.0% MOTA, MOT20에서 +18.8% MOTA의 새로운 최첨단 MOT 성능을 보고된 조건에서 달성했다.
밀집 이미지 관련 탐지 쿼리는 혼잡한 장면에서 고정된 희소 쿼리보다 누락 탐지와 노이즈를 줄인다.
희소 추적 쿼리는 이전 프레임 정보를 바탕으로 트래킹 어텐션의 속도를 크게 높이면서 정확도를 유지한다.
변형 TransCenter-Dual 및 TransCenter-Lite는 정확도와 계산 효율성 사이의 트레이드를 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.