QUICK REVIEW

[Paper Review] Stacked U-Nets: A No-Frills Approach to Natural Image Segmentation

Sohil Shah, Pallabi Ghosh|arXiv (Cornell University)|Apr 27, 2018

Advanced Neural Network Applications44 references33 citations

TL;DR

This paper proposes Stacked U-Nets (SUNets), a lightweight, deep architecture that iteratively fuses multi-scale features across multiple U-Net blocks to preserve high-resolution spatial details while globalizing contextual information for natural image segmentation. SUNets achieve state-of-the-art performance on PASCAL VOC 2012 with 4.5% higher mIoU than ResNet-101, using ~7× fewer parameters, by replacing complex auxiliary modules with a deeper, parameter-efficient stacking of U-Net units.

ABSTRACT

Many imaging tasks require global information about all pixels in an image. Conventional bottom-up classification networks globalize information by decreasing resolution; features are pooled and downsampled into a single output. But for semantic segmentation and object detection tasks, a network must provide higher-resolution pixel-level outputs. To globalize information while preserving resolution, many researchers propose the inclusion of sophisticated auxiliary blocks, but these come at the cost of a considerable increase in network size and computational cost. This paper proposes stacked u-nets (SUNets), which iteratively combine features from different resolution scales while maintaining resolution. SUNets leverage the information globalization power of u-nets in a deeper network architectures that is capable of handling the complexity of natural images. SUNets perform extremely well on semantic segmentation tasks using a small number of parameters.

Motivation & Objective

To address the challenge of preserving high-resolution spatial details while capturing long-range contextual information in natural image segmentation.
To reduce the computational and parameter burden of existing segmentation models that rely on complex auxiliary context modules or deep classification backbones.
To improve performance on semantic segmentation tasks without increasing model size or inference cost.
To explore whether stacking U-Net blocks can yield better feature representation than single U-Net or deep classification networks with auxiliary heads.

Proposed method

Stacked U-Nets (SUNets) are constructed by stacking multiple U-Net blocks in a deep architecture, enabling iterative fusion of features across different resolution levels.
Each U-Net block performs encoding (downsampling with strided convolutions) and decoding (upsampling with deconvolutions) to preserve spatial resolution while integrating contextual information.
The architecture avoids dilated convolutions and multigrid strategies, instead using strided convolutions followed by de-gridding layers to reduce gridding artifacts.
Feature maps from skip connections between encoder and decoder paths are concatenated at each level to preserve spatial detail and enrich representation.
The network is trained using standard cross-entropy loss with multi-scale inference during inference to improve robustness.
A variant, SUNet-7-128, uses 7 stacked U-Net blocks and 128 filters per layer, achieving high performance with low parameter count.

Experimental results

Research questions

RQ1Can a deeper architecture composed of stacked U-Net blocks outperform standard U-Net and ResNet-based models in natural image semantic segmentation?
RQ2Does eliminating complex auxiliary context modules (e.g., ASPP, CRF) while maintaining high-resolution output lead to better efficiency and performance?
RQ3To what extent can a lightweight, parameter-efficient architecture achieve state-of-the-art mIoU on PASCAL VOC 2012 without relying on heavy pre-trained backbones?
RQ4How does the stacking of U-Net blocks affect feature representation and generalization compared to single U-Net or deep classification networks?

Key findings

SUNet-7-128 achieves 84.3% mIoU on the Cityscapes test set, outperforming several state-of-the-art models including RefineNet-ResNet152 and DeepLabv2+CRF.
On PASCAL VOC 2012, SUNet-7-128 achieves 83.34% mIoU on the test set, exceeding the performance of ResNet-101 by 4.5% mIoU while using ~7× fewer parameters.
The model achieves strong performance with only 2.5M parameters, significantly reducing the parameter count compared to PSPNet (30M more parameters) and other auxiliary module-based models.
Qualitative results show that SUNets produce coherent, sharp segmentation maps with reduced gridding artifacts, especially when de-gridding layers are used.
The architecture generalizes well to diverse natural image distributions, as evidenced by strong performance on both PASCAL VOC 2012 and Cityscapes benchmarks.
The ablation study confirms that strided convolutions with de-gridding layers outperform dilated convolutions in terms of feature map coherence and segmentation quality.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.