QUICK REVIEW

[Paper Review] Crowd Counting by Adapting Convolutional Neural Networks with Side Information

Di Kang, Debarun Dhar|arXiv (Cornell University)|Nov 21, 2016

Video Surveillance and Tracking Methods19 references19 citations

TL;DR

This paper proposes Adaptive Convolutional Neural Networks (ACNN) that use side information—such as camera angle and height—to dynamically adjust convolutional filter weights, enabling context-aware feature learning. By modeling filter weights as a manifold parametrized by side information, ACNN improves crowd counting accuracy over standard CNNs and generalizes to unseen scene contexts without fine-tuning.

ABSTRACT

Computer vision tasks often have side information available that is helpful to solve the task. For example, for crowd counting, the camera perspective (e.g., camera angle and height) gives a clue about the appearance and scale of people in the scene. While side information has been shown to be useful for counting systems using traditional hand-crafted features, it has not been fully utilized in counting systems based on deep learning. In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolutional filter weights adapt to the current scene context via the side information. In particular, we model the filter weights as a low-dimensional manifold, parametrized by the side information, within the high-dimensional space of filter weights. With the help of side information and adaptive weights, the ACNN can disentangle the variations related to the side information, and extract discriminative features related to the current context. Since existing crowd counting datasets do not contain ground-truth side information, we collect a new dataset with the ground-truth camera angle and height as the side information. On experiments in crowd counting, the ACNN improves counting accuracy compared to a plain CNN with a similar number of parameters. We also apply ACNN to image deconvolution to show its potential effectiveness on other computer vision applications.

Motivation & Objective

To address the challenge of perspective distortion and appearance variation in crowd counting by explicitly modeling scene context using side information.
To overcome the limitation of standard CNNs that use fixed filters across all contexts, which entangle variations due to camera angle, height, and scale.
To develop a unified deep learning architecture that adapts to different scene contexts using auxiliary side information, enabling cross-scene deployment without fine-tuning.
To demonstrate the broader applicability of the ACNN framework beyond crowd counting, particularly in image deconvolution with variable blur kernels.
To collect a new dataset with ground-truth camera parameters to enable evaluation of context-aware counting in diverse real-world settings.

Proposed method

The ACNN architecture parameterizes convolutional filter weights as a low-dimensional manifold in the high-dimensional weight space, where the manifold is controlled by side information (e.g., camera tilt angle and height).
A sub-network generates the filter weights based on the side information, allowing the network to adapt its filters per scene context during inference.
The filter manifold is learned during training, enabling the network to disentangle context-related variations (e.g., perspective distortion) from content-related features.
The method uses a differentiable parameterization of filters, allowing end-to-end training with standard backpropagation.
For image deblurring, the auxiliary input is the blur kernel radius, and the ACNN learns a continuous filter manifold across different kernel sizes.
The architecture maintains a similar number of parameters to standard CNNs, ensuring efficiency while improving generalization.

Experimental results

Research questions

RQ1Can side information such as camera angle and height be effectively used to improve crowd counting accuracy in diverse scene contexts?
RQ2Can an adaptive CNN architecture generalize to unseen scene contexts (e.g., new camera angles or heights) without fine-tuning?
RQ3Does modeling filter weights as a manifold parametrized by side information lead to better feature disentanglement and improved performance compared to fixed filters?
RQ4Can the ACNN framework be extended to other computer vision tasks, such as image deblurring, with variable auxiliary inputs?
RQ5How does the ACNN perform in zero-shot generalization to unseen auxiliary inputs (e.g., blur kernel radii not seen during training) compared to standard CNNs?

Key findings

On the newly collected dataset with camera angle and height as side information, ACNN achieves higher crowd counting accuracy than a standard CNN with a similar number of parameters.
The ACNN generalizes effectively to cross-scene counting, achieving good performance on new camera angles and heights without any fine-tuning.
In image deblurring, the ACNN trained on multiple kernel radii (3, 5, 7, 9, 11) achieved a +1.03 dB increase in PSNR over the blurred input, outperforming a standard CNN by nearly double in improvement when tested on all radii.
When trained on only three radii (3, 7, 11), the ACNN still achieved a +0.84 dB PSNR gain, demonstrating strong zero-shot generalization to unseen kernel sizes.
Visual results show that ACNN outputs have more details and less blurring than standard CNNs, which tend to over-smooth deblurred images.
Learned filter manifolds in the deblurring task show that both amplitude and frequency of filters adapt smoothly with the blur kernel radius, confirming the model’s ability to interpolate across the auxiliary input space.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.