Skip to main content
QUICK REVIEW

[Paper Review] Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking

Heng Fan, Haibin Ling|arXiv (Cornell University)|Dec 14, 2018
Video Surveillance and Tracking Methods45 references28 citations
TL;DR

This paper proposes Siamese Cascaded Region Proposal Networks (C-RPN), a multi-stage visual tracking framework that improves accuracy and robustness by cascading multiple RPNs across feature levels in a Siamese network. By performing stage-wise hard negative sampling, leveraging multi-level features via a Feature Transfer Block (FTB), and using progressive regression with adaptive anchors, C-RPN achieves state-of-the-art performance on six benchmarks while running in real-time at ~32 fps.

ABSTRACT

Region proposal networks (RPN) have been recently combined with the Siamese network for tracking, and shown excellent accuracy with high efficiency. Nevertheless, previously proposed one-stage Siamese-RPN trackers degenerate in presence of similar distractors and large scale variation. Addressing these issues, we propose a multi-stage tracking framework, Siamese Cascaded RPN (C-RPN), which consists of a sequence of RPNs cascaded from deep high-level to shallow low-level layers in a Siamese network. Compared to previous solutions, C-RPN has several advantages: (1) Each RPN is trained using the outputs of RPN in the previous stage. Such process stimulates hard negative sampling, resulting in more balanced training samples. Consequently, the RPNs are sequentially more discriminative in distinguishing difficult background (i.e., similar distractors). (2) Multi-level features are fully leveraged through a novel feature transfer block (FTB) for each RPN, further improving the discriminability of C-RPN using both high-level semantic and low-level spatial information. (3) With multiple steps of regressions, C-RPN progressively refines the location and shape of the target in each RPN with adjusted anchor boxes in the previous stage, which makes localization more accurate. C-RPN is trained end-to-end with the multi-task loss function. In inference, C-RPN is deployed as it is, without any temporal adaption, for real-time tracking. In extensive experiments on OTB-2013, OTB-2015, VOT-2016, VOT-2017, LaSOT and TrackingNet, C-RPN consistently achieves state-of-the-art results and runs in real-time.

Motivation & Objective

  • Address the limitations of one-stage Siamese-RPN trackers in handling similar distractors and large scale variations.
  • Reduce class imbalance in training by introducing stage-wise hard negative sampling through cascaded RPNs.
  • Improve localization accuracy by progressively refining bounding boxes using multiple regression steps with adjusted anchors.
  • Enhance feature representation by fusing high-level semantic and low-level spatial features via a novel Feature Transfer Block (FTB).
  • Achieve real-time inference without temporal adaptation by training the entire cascade end-to-end.

Proposed method

  • Cascades multiple RPNs from deep (high-level) to shallow (low-level) layers in a Siamese network to form a multi-stage tracking pipeline.
  • Trains each RPN using only hard negative samples (filtered from previous stage outputs), enabling progressive learning of discriminative classifiers.
  • Introduces a Feature Transfer Block (FTB) that fuses features across multiple layers to enhance discriminability using both semantic and spatial information.
  • Employs multi-step regression: each RPN refines the target proposal using anchor boxes adjusted from the previous stage’s output.
  • Uses an end-to-end multi-task loss function combining classification and regression losses across all stages.
  • Deploys the trained C-RPN model directly in inference without online adaptation, ensuring real-time performance.

Experimental results

Research questions

  • RQ1Can a cascaded RPN architecture improve robustness to similar distractors compared to one-stage Siamese-RPN?
  • RQ2Does stage-wise hard negative sampling lead to better class balance and improved discrimination against difficult background samples?
  • RQ3Can multi-level feature fusion via a Feature Transfer Block (FTB) enhance tracking accuracy by combining semantic and spatial information?
  • RQ4Does progressive regression with adaptive anchors improve localization accuracy under large scale variations?
  • RQ5Can the cascaded design maintain real-time inference speed while achieving state-of-the-art performance?

Key findings

  • C-RPN achieves state-of-the-art performance on OTB-2013, OTB-2015, VOT-2016, VOT-2017, LaSOT, and TrackingNet, with consistent improvements over prior methods.
  • On LaSOT, C-RPN achieves a success score of 0.455 under Protocol II, outperforming the second-best tracker by 1.6% in SUC and 0.7% in EAO on VOT-2017.
  • On TrackingNet, C-RPN achieves a precision score of 0.619, normalized precision of 0.746, and success score of 0.669, surpassing MDNet (second best) by 5.4%, 4.1%, and 6.3% respectively.
  • Ablation studies confirm that each component contributes significantly: adding stages improves SUC by 2.9% (from 0.417 to 0.446) and EAO by 3.5% (from 0.248 to 0.283).
  • Removing negative anchor filtering reduces performance by 1.6% in SUC and 0.7% in EAO, confirming the importance of hard negative sampling.
  • The Feature Transfer Block (FTB) improves SUC by 1.3% and EAO by 1.1%, demonstrating the effectiveness of multi-level feature fusion.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.