QUICK REVIEW

[Paper Review] Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition

Ionuţ Cosmin Duţă, Li Liu|arXiv (Cornell University)|Jun 20, 2020

Advanced Neural Network Applications51 references138 citations

TL;DR

PyConv creates a multi-scale pyramid of kernels that processes input at varied spatial sizes and depths without increasing parameter count, improving performance across classification, segmentation, and related tasks.

ABSTRACT

This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales. PyConv contains a pyramid of kernels, where each level involves different types of filters with varying size and depth, which are able to capture different levels of details in the scene. On top of these improved recognition capabilities, PyConv is also efficient and, with our formulation, it does not increase the computational cost and parameters compared to standard convolution. Moreover, it is very flexible and extensible, providing a large space of potential network architectures for different applications. PyConv has the potential to impact nearly every computer vision task and, in this work, we present different architectures based on PyConv for four main tasks on visual recognition: image classification, video action classification/recognition, object detection and semantic image segmentation/parsing. Our approach shows significant improvements over all these core tasks in comparison with the baselines. For instance, on image recognition, our 50-layers network outperforms in terms of recognition performance on ImageNet dataset its counterpart baseline ResNet with 152 layers, while having 2.39 times less parameters, 2.52 times lower computational complexity and more than 3 times less layers. On image segmentation, our novel framework sets a new state-of-the-art on the challenging ADE20K benchmark for scene parsing. Code is available at: https://github.com/iduta/pyconv

Motivation & Objective

Address limitations of fixed-size kernels and limited receptive fields in standard CNNs.
Develop a multi-scale, multi-depth convolution operator (PyConv) that preserves parameter efficiency.
Demonstrate PyConv effectiveness across image classification, video action recognition, object detection, and semantic segmentation.
Provide architectures (PyConvResNet, PyConvHGResNet, PyConvSegNet) that outperform baselines on key visual recognition benchmarks.

Proposed method

Define PyConv as a pyramid of kernels with increasing spatial size and decreasing depth across levels.
Implement PyConv with grouped convolution to control kernel depth per level and maintain parameter parity with standard conv.
Embed PyConv into residual bottleneck blocks to form PyConvResNet and PyConvHGResNet architectures.
Propose PyConvPH (LocalPyConv, GlobalPyConv, Merge blocks) for semantic segmentation to capture local and global multi-scale context.
Compare performance against ResNet baselines on ImageNet and ADE20K, and analyze parameter/FLOP budgets.

Experimental results

Research questions

RQ1Can PyConv improve recognition performance while keeping similar parameter count and computational cost as standard convolution?
RQ2Does multi-scale, multi-depth kernel processing benefit diverse vision tasks (classification, segmentation, detection, video) when integrated into CNN backbones?
RQ3How should kernel sizes, depths, and grouping be configured across network stages for optimal accuracy and efficiency?
RQ4Is a multi-scale segmentation head (PyConvPH) able to outperform existing segmentation heads on ADE20K?

Key findings

PyConv-based networks outperform ResNet baselines on ImageNet while using fewer parameters and FLOPs (e.g., PyConvResNet-50: top-1 22.12%, 24.85M params, 3.88 GFLOPs).
PyConvHGResNet-50 achieves even stronger single-model accuracy (top-1 21.52%).
PyConv enables effective downsampling via multi-scale kernels, improving translation invariance without extra cost.
The PyConvSegNet framework with PyConvPH achieves competitive/strong results on ADE20K for scene parsing.
Across depths, PyConv variants converge faster during training and yield better validation accuracy than ResNet counterparts.
The results demonstrate that increasing kernel sizes across stages (e.g., 9x9, 7x7, 5x5, 3x3) with appropriate grouping yields consistent performance gains without inflating parameters.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.