QUICK REVIEW

[论文解读] Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking

Heng Fan, Haibin Ling|arXiv (Cornell University)|Dec 14, 2018

Video Surveillance and Tracking Methods参考文献 45被引用 28

一句话总结

本文提出了一种孪生级联区域提议网络（C-RPN），一种通过在孪生网络中跨特征层级级联多个RPN的多阶段视觉追踪框架。通过分阶段的困难负样本采样、利用特征迁移模块（FTB）融合多层级特征，以及采用自适应锚框的渐进回归，C-RPN在六个基准测试上实现了最先进性能，同时以约32 fps的速度实现实时推理。

ABSTRACT

Region proposal networks (RPN) have been recently combined with the Siamese network for tracking, and shown excellent accuracy with high efficiency. Nevertheless, previously proposed one-stage Siamese-RPN trackers degenerate in presence of similar distractors and large scale variation. Addressing these issues, we propose a multi-stage tracking framework, Siamese Cascaded RPN (C-RPN), which consists of a sequence of RPNs cascaded from deep high-level to shallow low-level layers in a Siamese network. Compared to previous solutions, C-RPN has several advantages: (1) Each RPN is trained using the outputs of RPN in the previous stage. Such process stimulates hard negative sampling, resulting in more balanced training samples. Consequently, the RPNs are sequentially more discriminative in distinguishing difficult background (i.e., similar distractors). (2) Multi-level features are fully leveraged through a novel feature transfer block (FTB) for each RPN, further improving the discriminability of C-RPN using both high-level semantic and low-level spatial information. (3) With multiple steps of regressions, C-RPN progressively refines the location and shape of the target in each RPN with adjusted anchor boxes in the previous stage, which makes localization more accurate. C-RPN is trained end-to-end with the multi-task loss function. In inference, C-RPN is deployed as it is, without any temporal adaption, for real-time tracking. In extensive experiments on OTB-2013, OTB-2015, VOT-2016, VOT-2017, LaSOT and TrackingNet, C-RPN consistently achieves state-of-the-art results and runs in real-time.

研究动机与目标

解决单阶段孪生-RPN追踪器在处理相似干扰物和大尺度变化时的局限性。
通过级联RPN实现分阶段困难负样本采样，缓解训练中的类别不平衡问题。
通过使用调整后锚框的多步回归，逐步优化边界框，提升定位精度。
通过新型特征迁移模块（FTB）融合高层语义特征与低层空间特征，增强特征表示能力。
通过端到端训练整个级联结构，实现实时推理而无需在线自适应。

提出的方法

在孪生网络中从深层（高层）到浅层（低层）级联多个RPN，构建多阶段追踪流水线。
仅使用前一阶段输出筛选出的困难负样本训练每个RPN，实现判别性分类器的渐进学习。
引入一种特征迁移模块（FTB），通过融合多层特征，结合语义与空间信息，提升特征判别能力。
采用多步回归：每个RPN利用前一阶段输出调整后的锚框对目标提议进行优化。
使用端到端多任务损失函数，联合优化所有阶段的分类损失与回归损失。
在推理阶段直接部署训练好的C-RPN模型，无需在线适应，确保实时性能。

实验结果

研究问题

RQ1与单阶段孪生-RPN相比，级联RPN架构是否能提升对相似干扰物的鲁棒性？
RQ2分阶段困难负样本采样是否能改善类别平衡并增强对困难背景样本的判别能力？
RQ3通过特征迁移模块（FTB）实现的多层级特征融合，是否能通过结合语义与空间信息提升追踪精度？
RQ4采用自适应锚框的渐进回归是否能在大尺度变化下提升定位精度？
RQ5级联设计是否能在保持最先进性能的同时维持实时推理速度？

主要发现

C-RPN在OTB-2013、OTB-2015、VOT-2016、VOT-2017、LaSOT和TrackingNet六个基准上均达到最先进性能，且对先前方法有持续性提升。
在LaSOT上，C-RPN在Protocol II下取得0.455的成功率（SUC），在VOT-2017上分别领先第二名1.6%（SUC）和0.7%（EAO）。
在TrackingNet上，C-RPN取得0.619的精度（PREC）、0.746的归一化精度（NPREC）和0.669的成功率（SUC），分别领先第二名MDNet 5.4%、4.1%和6.3%。
消融实验表明各组件均有显著贡献：增加阶段使SUC提升2.9%（从0.417增至0.446），EAO提升3.5%（从0.248增至0.283）。
移除负样本锚框过滤后，SUC下降1.6%，EAO下降0.7%，证实困难负样本采样至关重要。
特征迁移模块（FTB）使SUC提升1.3%，EAO提升1.1%，证明多层级特征融合的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。