QUICK REVIEW

[论文解读] Deeper and Wider Siamese Networks for Real-Time Visual Tracking

Zhipeng Zhang, Houwen Peng|arXiv (Cornell University)|Jan 7, 2019

Video Surveillance and Tracking Methods参考文献 42被引用 110

一句话总结

本论文引入裁剪内侧残差单元（CIR）以消除填充引起的位置偏差，并为 SiamFC 与 SiamRPN 构建更深更宽的孪生骨干网络（CIResNet 系列、CIResInception、CIResNeXt），以实时速度实现显著的精度提升。

ABSTRACT

Siamese networks have drawn great attention in visual tracking because of their balanced accuracy and speed. However, the backbone networks used in Siamese trackers are relatively shallow, such as AlexNet [18], which does not fully take advantage of the capability of modern deep neural networks. In this paper, we investigate how to leverage deeper and wider convolutional neural networks to enhance tracking robustness and accuracy. We observe that direct replacement of backbones with existing powerful architectures, such as ResNet [14] and Inception [33], does not bring improvements. The main reasons are that 1)large increases in the receptive field of neurons lead to reduced feature discriminability and localization precision; and 2) the network padding for convolutions induces a positional bias in learning. To address these issues, we propose new residual modules to eliminate the negative impact of padding, and further design new architectures using these modules with controlled receptive field size and network stride. The designed architectures are lightweight and guarantee real-time tracking speed when applied to SiamFC [2] and SiamRPN [20]. Experiments show that solely due to the proposed network architectures, our SiamFC+ and SiamRPN+ obtain up to 9.8%/5.7% (AUC), 23.3%/8.8% (EAO) and 24.4%/25.0% (EAO) relative improvements over the original versions [2, 20] on the OTB-15, VOT-16 and VOT-17 datasets, respectively.

研究动机与目标

分析骨干网络的深度和宽度如何影响孪生追踪器的性能。
识别使用更深网络时导致性能下降的因素。
提出能够消除填充引起的位置偏差的残差模块。
在受控感受野和步长下设计更深更宽的基于 CIR 的骨干网络。
在标准基准上展示实时跟踪性能的改进精度。

提出的方法

引入裁剪内部残差（CIR）单元，在残差相加后裁剪受填充影响的特征。
在 SiamFC 和 SiamRPN 中，用 CIR 基骨干替换包含填充的骨干（CIResNet、CIResInception、CIResNeXt）。
控制感受野尺寸和网络步长，使 RF 相对于 exemplar 大小保持 60%-80% 的比例。
使用 CIR 单元构建更深更宽的网络，以平衡定位精度和特征丰富度。
使用 ImageNet 预训练来训练网络，并在 SiamFC/SiamRPN 框架中分阶段解冻微调。
在标准跟踪基准（OTB、VOT）上进行评估，并与 AlexNet 基线及最先进的跟踪器进行比较。

实验结果

研究问题

RQ1深度、宽度、感受野、步长和填充如何影响孪生追踪的准确性和定位？
RQ2填充引发的位置偏差是否会降低孪生追踪的性能，如何缓解？
RQ3基于 CIR 的更深和/或更宽的骨干网络是否能在保持实时速度的同时提升 SiamFC、SiamRPN 的精度？
RQ4哪些架构准则能最大化孪生特征嵌入的鲁棒性和判别力？

主要发现

当使用更深的骨干网络时，孪生追踪器从更小的步长（4 或 8）中获益，而非更大步长。
最后一层神经元的最佳感受野约为 exemplar 大小的60%-80%，最大感受野不应超过 exemplar。
在全卷积孪生网络中的填充引入位置偏置，导致靠近图像边界的定位下降。
CIR 单元（及其广义版本 CIR-Inception、CIR-NeXt）移除受填充影响的特征并提升判别力，相较于 AlexNet 基线取得显著提升。
CIResNet-22 实现显著提升：在 OTB-15 的 SiamRPN/SiamFC 变体上 AUC 最高提升至 +9.8%，在 VOT-17 的变体上 EA0? 它写为 EA0？应为 EAO +23.3%，并且具备实时速度（例如，取决于设置大致为 ~70–150 FPS）。
SiamFC+ 和 SiamRPN+（使用 CIResNet-22）在 OTB-2015 和 VOT-17 上超过先前的孪生追踪器，SiamRPN+ 在 GTX 1080 上达到约 ~150 FPS。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。