QUICK REVIEW

[论文解读] OmniTracker: Unifying Object Tracking by Tracking-with-Detection

Junke Wang, Zuxuan Wu|arXiv (Cornell University)|Mar 21, 2023

Video Surveillance and Tracking Methods被引用 11

一句话总结

OmniTracker 提出一个统一的基于 Deformable DETR 的模型，通过 tracking-with-detection 范式联合处理实例追踪（SOT/VOS）和类别追踪（MOT/MOTS/VIS），使用 Reference-guided Feature Enhancement 和共享网络权重。

ABSTRACT

Visual Object Tracking (VOT) aims to estimate the positions of target objects in a video sequence, which is an important vision task with various real-world applications. Depending on whether the initial states of target objects are specified by provided annotations in the first frame or the categories, VOT could be classified as instance tracking (e.g., SOT and VOS) and category tracking (e.g., MOT, MOTS, and VIS) tasks. Different definitions have led to divergent solutions for these two types of tasks, resulting in redundant training expenses and parameter overhead. In this paper, combing the advantages of the best practices developed in both communities, we propose a novel tracking-with-detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for the association. Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline, eliminating the need for task-specific architectures and reducing redundancy in model parameters. We conduct extensive experimentation on seven prominent tracking datasets of different tracking tasks, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, and demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.

研究动机与目标

推动一个统一的追踪框架，覆盖实例追踪和类别追踪两类任务。
提出一种 tracking-with-detection 范式，在其中由追踪器得到的先验信息增强检测，检测框帮助追踪关联。
开发 OmniTracker，使其具备共享的架构、权重和推理流程，能够处理多种追踪任务。
利用基于记忆的身份嵌入和对比 ReID 损失，在跨帧中稳健地关联对象。

提出的方法

引入一个 Reference-guided Feature Enhancement（RFE）模块，通过跨注意力将前一帧的外观先验与当前帧特征融合。
将增强后的特征嵌入 Deformable DETR 检测器，以预测所有帧的边界框和掩码。
使用带对比 ReID 损失的身份嵌入记忆库，在跨帧学习稳定的对象身份。
在集合预测框架中，计算每帧检测损失，结合分类、框回归和掩码项。
采用统一的在线追踪流水线，配合卡尔曼滤波运动建模和跨所有任务的匈牙利数据关联。
在多种追踪数据集（SOT、VOS、MOT、MOTS、VIS）以及 COCO 上进行联合训练，以实现任务统一优化。

实验结果

研究问题

RQ1一个单一、共享的网络架构和训练方案是否能够有效同时解决实例追踪和类别追踪任务？
RQ2引入 Reference-guided Feature Enhancement（RFE）是否能提升检测器的外观先验以用于追踪，并实现跨帧鲁棒关联？
RQ3在多样化追踪任务上的联合训练相比于任务特定或混合训练，对性能与泛化有何影响？
RQ4基于记忆的身份嵌入和对比 ReID 损失在跨帧维持一致对象身份方面起到何种作用？

主要发现

OmniTracker 在包括 LaSOT、TrackingNet、DAVIS 16-17、MOT17、MOTS20 和 YTVIS19 在内的七个追踪基准上达到最先进或具有竞争力的结果。
RFE 模块在带有外观先验的检测中带来改进，在对 TrackingNet 的 P_norm 和 MOT17 的 MOTA 进行消融测试时表现提升。
跨任务联合训练在多个基准上相较于分任务训练和 Unicorn 基线取得持续的提升，在若干基准上有显著收益。
OmniTracker 维持一个用于 SOT、VOS、MOT、MOTS、VIS 的完全共享流水线，与面向任务的模型相比，FPS 具有竞争力。
在 VOS 上，OmniTracker 超越多任务基线和一些统一模型，展现出强的逐帧和长期关联性能。
对于 VIS，OmniTracker-L 在 mAP 和相关指标上与专门针对 VIS 的方法具有竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。