QUICK REVIEW

[论文解读] Crowd Counting by Adaptively Fusing Predictions from an Image Pyramid

Di Kang, Antoni B. Chan|arXiv (Cornell University)|May 16, 2018

Video Surveillance and Tracking Methods参考文献 18被引用 94

一句话总结

本文提出一种基于图像金字塔的众包计数方法，通过跨尺度注意力图与1x1融合自适应地融合来自多尺度的密度预测，实现快速、近实时且具有高精度的性能。

ABSTRACT

Because of the powerful learning capability of deep neural networks, counting performance via density map estimation has improved significantly during the past several years. However, it is still very challenging due to severe occlusion, large scale variations, and perspective distortion. Scale variations (from image to image) coupled with perspective distortion (within one image) result in huge scale changes of the object size. Earlier methods based on convolutional neural networks (CNN) typically did not handle this scale variation explicitly, until Hydra-CNN and MCNN. MCNN uses three columns, each with different filter sizes, to extract features at different scales. In this paper, in contrast to using filters of different sizes, we utilize an image pyramid to deal with scale variations. It is more effective and efficient to resize the input fed into the network, as compared to using larger filter sizes. Secondly, we adaptively fuse the predictions from different scales (using adaptively changing per-pixel weights), which makes our method adapt to scale changes within an image. The adaptive fusing is achieved by generating an across-scale attention map, which softly selects a suitable scale for each pixel, followed by a 1x1 convolution. Extensive experiments on three popular datasets show very compelling results.

研究动机与目标

解决人群计数中的大幅内部尺度变化和透视畸变。
提出一种基于图像金字塔的 FCN 主干网络，以生成尺度特定的密度图。
开发一种带跨尺度注意力的自适应融合机制，实现逐像素选择合适的尺度。
展示在实时或更快于实时推断下的最先进或具有竞争力的性能。
在 ShanghaiTech、WorldExpo 和 UCSD 数据集上进行评估以验证有效性。

提出的方法

通过将输入图像下采样到多个尺度来构建图像金字塔。
使用共享的主干 FCN 处理每个尺度，产生密度图。
从尺度特定分支的最后一层特征图生成跨尺度注意力图。
对跨尺度应用 softmax 以获得逐像素的尺度权重并与相应的密度图相乘。
使用 1x1 卷积对所有尺度的经整流处理的密度图进行融合，以获得最终的密度图。
端到端训练，针对32x32的密度补丁（来自128x128输入）使用逐像素的均方误差损失。

实验结果

研究问题

RQ1在单张图像中对象大小和透视变化下，带自适应逐像素尺度融合的图像金字塔是否能改善人群计数？
RQ2基于注意力的融合是否优于固定或简单的多尺度密度图融合策略？
RQ3所提出的带有限制下采样的 FCN 主干网络如何影响密度图质量和运行时？
RQ4与现有的多尺度计数方法在标准数据集上的性能相比如何？

主要发现

自适应图像金字塔融合在 ShanghaiTech Part A/B、WorldExpo 和 UCSD 上优于单尺度 FCN 基线。
FCN-7c-3s（3尺度融合）在 ShanghaiTech Part A 上达到 MAE 80.6 和 RMSE 126.7；在 Part B 上达到 MAE 10.2 和 RMSE 18.3。
我们的方法在2尺度（FCN-7c-2s）上在 Part A 获得 MAE 81.3 和 RMSE 132.6，在 Part B 获得 MAE 10.9 和 RMSE 19.1。
与 CNN-patch、MCNN、Switch-CNN 和 CP-CNN 相比，我们的 FCN-7c-3s 在 MAE/MSE 方面具有竞争力，同时在高分辨率图像上实现了大于实时的推断速度。
基于注意力的融合（跨尺度 softmax）至关重要；若去除 softmax 或采用固定融合将表现不佳。
我们的方法在不同配置下的帧速为 158-439 fps，提供了有利的速度-精度平衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。