Skip to main content
QUICK REVIEW

[论文解读] TransCenter: Transformers with Dense Queries for Multiple-Object Tracking

Yihong Xu, Yutong Ban|arXiv (Cornell University)|Jul 22, 2021
Video Surveillance and Tracking Methods参考文献 69被引用 75
一句话总结

TransCenter 引入一个基于中心的 transformer MOT 框架,使用与图像相关的密集检测查询和稀疏跟踪查询,在 MOT 基准上实现了最先进的结果。

ABSTRACT

Transformers have proven superior performance for a wide variety of tasks since they were introduced. In recent years, they have drawn attention from the vision community in tasks such as image classification and object detection. Despite this wave, an accurate and efficient multiple-object tracking (MOT) method based on transformers is yet to be designed. We argue that the direct application of a transformer architecture with quadratic complexity and insufficient noise-initialized sparse queries - is not optimal for MOT. We propose TransCenter, a transformer-based MOT architecture with dense representations for accurately tracking all the objects while keeping a reasonable runtime. Methodologically, we propose the use of image-related dense detection queries and efficient sparse tracking queries produced by our carefully designed query learning networks (QLN). On one hand, the dense image-related detection queries allow us to infer targets' locations globally and robustly through dense heatmap outputs. On the other hand, the set of sparse tracking queries efficiently interacts with image features in our TransCenter Decoder to associate object positions through time. As a result, TransCenter exhibits remarkable performance improvements and outperforms by a large margin the current state-of-the-art methods in two standard MOT benchmarks with two tracking settings (public/private). TransCenter is also proven efficient and accurate by an extensive ablation study and comparisons to more naive alternatives and concurrent works. For scientific interest, the code is made publicly available at https://github.com/yihongxu/transcenter.

研究动机与目标

  • Motivate and design a transformer-based MOT method that avoids gaps from sparse queries and over-detections in crowded scenes.
  • Introduce image-related dense detection queries to provide global, robust detections.
  • Develop sparse tracking queries and a specialized decoder to efficiently associate objects across frames.
  • Reduce computational complexity to achieve efficient MOT with dense queries.
  • Provide variants (TransCenter, TransCenter-Dual, TransCenter-Lite) to balance accuracy and efficiency.

提出的方法

  • Use a weight-shared transformer encoder (PVT-based) to produce multi-scale dense memories from consecutive frames.
  • Introduce Query Learning Networks (QLN) to generate dense detection queries and sparse tracking queries from encoder memories.
  • Employ a TransCenter Decoder with Deformable Cross-Attention for both Tracking (TDCA) and Detection (DDCA).
  • Leverage sparse tracking queries with prior frame positions to compute object displacements through time.
  • Output branches compute center heatmaps, object sizes, and tracking displacements; use a center heatmap for detections and a tracking branch for temporally associating objects.
  • Train with a center heatmap focal loss, sparse regression loss for sizes, tracking loss, and an overall weighted loss.

实验结果

研究问题

  • RQ1Can a transformer-based MOT model with dense image-related detection queries outperform sparse-query DETR-based MOT approaches?
  • RQ2Does separating dense detection queries from sparse tracking queries improve detection robustness and tracking efficiency in crowded scenes?
  • RQ3How do different QLN and decoder designs affect MOT accuracy and efficiency?
  • RQ4What is the impact of using efficient encoders (PVT) and deformable attention on MOT runtime and performance?

主要发现

  • TransCenter sets new state-of-the-art MOT performance on MOT17 (+4.0% MOTA) and MOT20 (+18.8% MOTA) under their reported conditions.
  • Dense image-related detection queries reduce miss detections and noise compared to fixed sparse queries in crowded scenes.
  • Sparse tracking queries, driven by prior frame information, significantly speed up tracking attention without sacrificing accuracy.
  • Variants TransCenter-Dual and TransCenter-Lite provide trade-offs between accuracy and computational efficiency.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。