[Paper Review] Type-to-Track: Retrieve Any Object via Prompt-based Tracking
The paper introduces Type-to-Track, a conversational, prompt-guided framework for grounded multiple object tracking, along with the GroOT dataset and the MENDER model, achieving state-of-the-art performance with higher efficiency. It formulates a single-stage, class-agnostic tracker that uses natural language prompts to retrieve and track objects in video sequences.
One of the recent trends in vision problems is to use natural language captions to describe the objects of interest. This approach can overcome some limitations of traditional methods that rely on bounding boxes or category annotations. This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track, which allows users to track objects in videos by typing natural language descriptions. We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT, that contains videos with various types of objects and their corresponding textual captions describing their appearance and action in detail. Additionally, we introduce two new evaluation protocols and formulate evaluation metrics specifically for this task. We develop a new efficient method that models a transformer-based eMbed-ENcoDE-extRact framework (MENDER) using the third-order tensor decomposition. The experiments in five scenarios show that our MENDER approach outperforms another two-stage design in terms of accuracy and efficiency, up to 14.7% accuracy and 4$ imes$ speed faster.
Motivation & Objective
- Motivate tracking by natural language prompts to improve intuitiveness and responsiveness over bounding-box or category-based methods.
- Create a large, diverse dataset (GroOT) with videos and rich textual descriptions to support grounded MOT.
- Develop an efficient transformer-based model (MENDER) that uses third-order tensor modeling to track multiple objects from prompts.
- Formulate new evaluation protocols and class-agnostic metrics to benchmark prompt-based tracking.
Proposed method
- Formulate a third-order tensor based auto-regressive framework to model image tokens, tracklets, and prompt tokens (Tt = 1D×D×D ×1 enc(It) ×2 ext(Tt−1) ×3 emb(P)).
- Introduce MENDER, a single-stage attention-based tracker that simplifies correlations to reduce complexity from O(n^3) to O(n^2) by equating region-prompt with tracklet-prompt relationships.
- Use cross-attention to model region-tracklet-prompt correlations and an object decoder to predict bounding boxes and confidences (Eq. 11).
- Train with alignment loss LT|P, objectness loss LI|T, and LGIoU for regression, following a Hungarian assignment for ground-truth matching.
- Leverage RoBERTa for text embeddings and a ResNet-101 backbone with Deformable DETR-style encoding to produce visual tokens (D=512).
- Evaluate across five GroOT settings (three standard plus two prompt-based prompts) and compare with a two-stage baseline (MDETR + TFm) and state-of-the-art MOT methods.
Experimental results
Research questions
- RQ1Can natural language prompts effectively specify and retrieve multiple objects over time in a tracking setting?
- RQ2Does a single-stage, class-agnostic tracker with prompt-based inputs outperform traditional two-stage pipelines on grounded MOT tasks?
- RQ3How do different prompt formulations (name, synonyms, definitions, captions) impact tracking accuracy and efficiency?
- RQ4What are robust, class-agnostic metrics and evaluation protocols for Type-to-Track scenarios?
- RQ5Is the proposed MENDER approach scalable to long video sequences with many objects under various prompts?
Key findings
- MENDER outperforms a two-stage baseline design in accuracy and efficiency, up to 14.7% accuracy improvement and 4× speedup.
- Across five GroOT settings, MENDER achieves state-of-the-art class-agnostic metrics (CA-MOTA, CA-IDF1, CA-HOTA) and competitive mAP50.
- The simplified correlation representation yields up to 2× speed gains (e.g., 7.8 FPS vs 3.4 FPS on MOT17 cap setting) with slight accuracy improvements.
- GroOT is a 2× larger and more diverse MOT dataset with 833 object classes and 256K word captions, enabling richer evaluation of grounded MOT with prompts.
- MENDER maintains identity tracking with a single-stage design, reducing the need for separate detection and tracking feature extraction.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.