QUICK REVIEW

[Paper Review] CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes

Yuhong Li, Xiaofan Zhang|arXiv (Cornell University)|Feb 27, 2018

Video Surveillance and Tracking Methods32 references168 citations

TL;DR

CSRNet introduces a deep, end-to-end CNN using a VGG-16 front-end and a dilated back-end to generate high-quality crowd density maps and accurate counts in congested scenes, outperforming state-of-the-art methods.

ABSTRACT

We propose a network for Congested Scene Recognition called CSRNet to provide a data-driven and deep learning method that can understand highly congested scenes and perform accurate count estimation as well as present high-quality density maps. The proposed CSRNet is composed of two major components: a convolutional neural network (CNN) as the front-end for 2D feature extraction and a dilated CNN for the back-end, which uses dilated kernels to deliver larger reception fields and to replace pooling operations. CSRNet is an easy-trained model because of its pure convolutional structure. We demonstrate CSRNet on four datasets (ShanghaiTech dataset, the UCF_CC_50 dataset, the WorldEXPO'10 dataset, and the UCSD dataset) and we deliver the state-of-the-art performance. In the ShanghaiTech Part_B dataset, CSRNet achieves 47.3% lower Mean Absolute Error (MAE) than the previous state-of-the-art method. We extend the targeted applications for counting other objects, such as the vehicle in TRANCOS dataset. Results show that CSRNet significantly improves the output quality with 15.4% lower MAE than the previous state-of-the-art approach.

Motivation & Objective

Motivate accurate crowd counting and density map generation in highly congested scenes.
Develop a data-driven, end-to-end CNN that preserves resolution while expanding receptive fields.
Improve over multi-column CNN architectures by using a deeper, single-column model with dilated convolutions.

Proposed method

Use VGG-16 up to the first 10 layers as the front-end for 2D feature extraction.
Replace pooling with dilated convolutions in the back-end to enlarge receptive fields without reducing resolution.
Train end-to-end with Euclidean loss between predicted and ground-truth density maps.
Generate ground-truth density maps using geometry-adaptive Gaussian kernels.
Apply data augmentation and an end-to-end framework for density map and count estimation.

Experimental results

Research questions

RQ1Can a deeper single-column CNN with dilated convolutions outperform multi-column architectures in dense crowd counting?
RQ2Does preserving spatial resolution via dilation improve density map quality and counting accuracy across benchmarks?
RQ3How do CSRNet’s density maps compare to ground-truth density maps in terms of PSNR/SSIM across datasets?

Key findings

CSRNet achieves state-of-the-art MAE/MSE on ShanghaiTech Part_A (68.2/115.0) and Part_B (10.6/16.0) compared to prior methods.
On UCF_CC_50, CSRNet attains MAE 266.1 and MSE 397.5, outperforming several baselines.
CSRNet yields the best average performance on WorldExpo’10 across five scenes (average MAE 8.6, SSIM 0. ?).
On UCSD, CSRNet reports MAE 1.16 and MSE 1.47, competitive with MCNN.
For TRANCOS vehicle counting, CSRNet achieves GAME(0)=3.56, GAME(1)=5.49, GAME(2)=8.57, GAME(3)=15.04, showing robust generalization.
CSRNet provides higher-density-map quality with PSNR 23.79 and SSIM 0.76 on ShanghaiTech Part_A, outperforming MCNN and CP-CNN.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.