QUICK REVIEW

[论文解读] Fused Deep Neural Networks for Efficient Pedestrian Detection

Xianzhi Du, Mostafa El‐Khamy|arXiv (Cornell University)|May 2, 2018

Video Surveillance and Tracking Methods参考文献 1被引用 27

一句话总结

本文提出了一种融合深度神经网络（F-DNN），通过结合单阶段检测器生成候选框、集成深度验证网络与语义分割网络，实现高效且准确的行人检测。采用新颖的软标签训练方法与软拒绝融合策略，该系统在Caltech数据集上实现了7.67%的对数平均漏检率，达到当前最优性能，同时保持了较高的运行速度。

ABSTRACT

In this paper, we present an efficient pedestrian detection system, designed by fusion of multiple deep neural network (DNN) systems. Pedestrian candidates are first generated by a single shot convolutional multi-box detector at different locations with various scales and aspect ratios. The candidate generator is designed to provide the majority of ground truth pedestrian annotations at the cost of a large number of false positives. Then, a classification system using the idea of ensemble learning is deployed to improve the detection accuracy. The classification system further classifies the generated candidates based on opinions of multiple deep verification networks and a fusion network which utilizes a novel soft-rejection fusion method to adjust the confidence in the detection results. To improve the training of the deep verification networks, a novel soft-label method is devised to assign floating point labels to the generated pedestrian candidates. A deep context aggregation semantic segmentation network also provides pixel-level classification of the scene and its results are softly fused with the detection results by the single shot detector. Our pedestrian detector compared favorably to state-of-art methods on all popular pedestrian detection datasets. For example, our fused DNN has better detection accuracy on the Caltech Pedestrian dataset than all previous state of art methods, while also being the fastest. We significantly improved the log-average miss rate on the Caltech pedestrian dataset to 7.67% and achieved the new state-of-the-art.

研究动机与目标

解决在遮挡与人群密集等复杂场景下实现高精度与实时速度的行人检测挑战。
通过集成学习与置信度融合改进候选框验证，降低行人检测中的误报率。
通过软融合方式将像素级语义分割与边界框检测相结合，提升检测在复杂场景下的鲁棒性。
利用新颖的软标签方法，将IoU重叠度编码为浮点标签，提升验证网络的训练效率与泛化能力。
设计轻量化、快速推理的流水线，通过可学习置信度加权融合多网络输出，保持高精度。

提出的方法

使用基于SSD的单阶段检测器作为候选框生成器，在多尺度与多长宽比下生成高覆盖度、高误报率的行人候选框。
在软标签候选框上独立训练多个深度验证网络（GoogLeNet、ResNet-50），其中标签为预测框与真实框之间的IoU。
实现一个软拒绝融合网络，通过可学习权重将多个验证网络与候选生成器的预测结果进行融合，以调整置信度分数。
集成一个深度上下文聚合语义分割网络，提供像素级场景理解，其输出通过基于核的方法与检测置信度实现软融合。
通过端到端学习融合网络参数，实现对不同网络意见的自适应加权。
通过选择性处理高度超过40像素的候选框，并与SqueezeNet融合，实现速度与精度的权衡。

实验结果

研究问题

RQ1使用多个深度验证网络进行集成学习，是否能在保持实时推理速度的同时提升行人检测精度？
RQ2与硬性二值标签相比，使用基于IoU的软标签是否能提升行人验证网络的性能？
RQ3将语义分割预测与目标检测融合，在遮挡或杂乱等挑战性场景下，能在多大程度上提升检测鲁棒性？
RQ4可学习的软拒绝融合机制是否能优于简单平均或投票方式，以更优地融合多个深度网络的输出？
RQ5架构选择（如网络类型、融合策略）对检测精度与推理速度之间的权衡有何影响？

主要发现

所提出的F-DNN在Caltech行人检测数据集上实现了新的SOTA对数平均漏检率7.67%，优于先前工作的8.18%。
在Caltech数据集上，该系统在所有SOTA方法中推理速度最快，当与SqueezeNet融合时，每张图像处理时间仅为0.09秒。
软标签方法显著提升了验证网络性能，通过基于IoU的连续监督，尤其在部分重叠的模糊情形下表现更优。
融合网络学习到为ResNet-50分配更高的权重（2.22），而GoogLeNet的权重为1.11，反映出训练数据中非遮挡行人占主导地位。
语义分割的引入提升了在人群密集与遮挡场景下的检测鲁棒性，定性可视化结果验证了这一点。
该系统在Caltech、INRIA与ETH数据集上均以更高的精度与速度超越所有先前SOTA方法，并在KITTI数据集上达到相当的性能水平。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。