QUICK REVIEW

[论文解读] FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras

Shanghang Zhang, Guanhang Wu|arXiv (Cornell University)|Jul 29, 2017

Video Surveillance and Tracking Methods参考文献 25被引用 23

一句话总结

该论文提出FCN-rLSTM，一种结合全卷积网络（FCN）与残差长短期记忆（rLSTM）网络的深度时空神经网络，用于在低质量城市摄像头视频中计数车辆。通过基于累积密度图的残差学习建模时间动态，其在基准数据集上的平均绝对误差（MAE）最高降低42%，训练速度提升5倍，展现出在低分辨率、低帧率及高遮挡条件下的鲁棒性。

ABSTRACT

In this paper, we develop deep spatio-temporal neural networks to sequentially count vehicles from low quality videos captured by city cameras (citycams). Citycam videos have low resolution, low frame rate, high occlusion and large perspective, making most existing methods lose their efficacy. To overcome limitations of existing methods and incorporate the temporal information of traffic video, we design a novel FCN-rLSTM network to jointly estimate vehicle density and vehicle count by connecting fully convolutional neural networks (FCN) with long short term memory networks (LSTM) in a residual learning fashion. Such design leverages the strengths of FCN for pixel-level prediction and the strengths of LSTM for learning complex temporal dynamics. The residual learning connection reformulates the vehicle count regression as learning residual functions with reference to the sum of densities in each frame, which significantly accelerates the training of networks. To preserve feature map resolution, we propose a Hyper-Atrous combination to integrate atrous convolution in FCN and combine feature maps of different convolution layers. FCN-rLSTM enables refined feature representation and a novel end-to-end trainable mapping from pixels to vehicle count. We extensively evaluated the proposed method on different counting tasks with three datasets, with experimental results demonstrating their effectiveness and robustness. In particular, FCN-rLSTM reduces the mean absolute error (MAE) from 5.31 to 4.21 on TRANCOS, and reduces the MAE from 2.74 to 1.53 on WebCamT. Training process is accelerated by 5 times on average.

研究动机与目标

解决现有方法在低分辨率、低帧率、高遮挡的城市摄像头视频中难以实现准确车辆计数的挑战。
利用序列视频帧中的时间相关性，在运动和分辨率受限的情况下提升计数准确性。
开发一种端到端可训练的时空深度学习框架，联合估计车辆密度与全局计数。
通过将计数回归重新表述为相对于累积密度和的残差函数学习，加速训练过程。
在不同视频质量与时间一致性条件下，实现对多样化交通场景和数据集的鲁棒性能。

提出的方法

FCN-rLSTM将全卷积网络（FCN）用于像素级车辆密度预测，结合堆叠的长短期记忆（LSTM）网络以建模时间动态。
通过残差学习连接，将全局车辆计数回归重新表述为相对于帧间密度图总和的残差函数学习，提升训练稳定性和速度。
超空洞（Hyper-Atrous）组合在FCN中集成空洞（atrous）卷积，并融合多层卷积特征图，以保持空间分辨率并增强特征表示。
网络按顺序处理视频帧，FCN输出（密度图）输入LSTM，预测残差计数，再与累积密度相加，生成最终车辆计数。
整个架构为端到端可训练，支持从原始像素直接优化至全局车辆计数。
通过选择FCN-rLSTM（适用于时间数据）或FCN-HA（适用于非时间数据）配置，将方法适配至具有或不具有时间相关性的数据集。

实验结果

研究问题

RQ1深度时空网络架构能否有效建模低帧率、高遮挡的低质量城市摄像头视频中的车辆计数动态？
RQ2在车辆计数任务中，FCN与LSTM之间引入残差学习是否能提升训练速度与模型收敛性？
RQ3空洞卷积与多尺度特征融合的整合是否能增强低分辨率视频输入的特征表示？
RQ4在不同数据集上，所提方法在准确率与鲁棒性方面相较于最先进方法表现如何？
RQ5当运动与分辨率受限时，序列帧间的时间相关性在多大程度上能提升计数性能？

主要发现

在TRANCOS数据集上，FCN-rLSTM将平均绝对误差（MAE）从5.31降低至4.21，较最佳基线提升20.7%。
在WebCamT数据集上，FCN-rLSTM将MAE从2.74降低至1.53，实现44.2%的相对提升。
由于残差学习的公式设计，与非残差基线相比，训练时间平均提速5倍。
在UCSD行人计数数据集上，FCN-rLSTM实现MAE为1.54，MSE为3.02，优于所有基线方法及FCN-HA配置。
该模型展现出强大的泛化能力，在车辆与行人计数任务中均表现优异，尽管对象尺度与场景复杂度存在差异。
消融研究证实，通过LSTM进行时间建模显著提升了在具有时序一致性的数据集上的性能，验证了时间相关性在低质量视频中的重要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。