QUICK REVIEW

[Paper Review] DSSD : Deconvolutional Single Shot Detector

Cheng-Yang Fu, Wei Liu|arXiv (Cornell University)|Jan 23, 2017

Advanced Neural Network Applications3 references1,636 citations

TL;DR

DSSD adds deconvolutional encoder-decoder context to SSD with Residual-101, achieving 81.5% mAP on VOC2007 and 33.2% mAP on COCO, outperforming prior single-network detectors.

ABSTRACT

The main contribution of this paper is an approach for introducing additional context into state-of-the-art general object detection. To achieve this we first combine a state-of-the-art classifier (Residual-101[14]) with a fast detection framework (SSD[18]). We then augment SSD+Residual-101 with deconvolution layers to introduce additional large-scale context in object detection and improve accuracy, especially for small objects, calling our resulting system DSSD for deconvolutional single shot detector. While these two contributions are easily described at a high-level, a naive implementation does not succeed. Instead we show that carefully adding additional stages of learned transformations, specifically a module for feed-forward connections in deconvolution and a new output module, enables this new approach and forms a potential way forward for further detection research. Results are shown on both PASCAL VOC and COCO detection. Our DSSD with $513 imes 513$ input achieves 81.5% mAP on VOC2007 test, 80.0% mAP on VOC2012 test, and 33.2% mAP on COCO, outperforming a state-of-the-art method R-FCN[3] on each dataset.

Motivation & Objective

Motivate improving general object detection by injecting larger-scale contextual information.
Investigate replacing VGG with a deeper backbone (Residual-101) in SSD for higher accuracy.
Develop a deconvolution-based hourglass module to pass semantic context to later prediction layers.
Introduce a prediction module and a deconvolution module to stabilize training and improve small-object detection.

Proposed method

Replace VGG with Residual-101 as the base network in SSD to improve feature quality.
Add a prediction module with residual blocks to enhance prediction layers and stabilize training.
Attach deconvolution layers after SSD to form an asymmetric encoder-decoder (hourglass) network.
Incorporate a deconvolution module with batch normalization and learned upsampling, combined via element-wise product for context fusion.
Use skip connections to pass high-level context to finer-resolution feature maps, creating DSSD.
Train in two stages: first freeze SSD and train deconvolution side, then fine-tune entire network; adopt SSD-like data augmentation and adjusted aspect ratios for default boxes.

Experimental results

Research questions

RQ1Can adding a deconvolution-based encoder-decoder (hourglass) structure to SSD improve accuracy, especially for small objects?
RQ2Does replacing VGG with Residual-101 and introducing a dedicated prediction module improve VOC/COCO detection performance without sacrificing speed?
RQ3What is the impact of different feature fusion strategies (sum vs product) in the deconvolution module on detection accuracy?
RQ4How does training strategy (two-stage training with frozen backbone followed by full fine-tuning) affect convergence and final performance?

Key findings

DSSD with Residual-101 and deconvolution layers achieves higher accuracy than SSD and competitive state-of-the-art methods on VOC and COCO.
Prediction modules and deconvolution modules significantly improve mAP, especially for small objects and context-specific classes.
Element-wise product fusion in the deconvolution module yields best accuracy among tested fusion methods.
On VOC2007, DSSD with 513 input achieves 81.5% mAP, outperforming prior single-network detectors like R-FCN and SSD variants.
On VOC2012, DSSD achieves 80.0% mAP, and on COCO, DSSD 513 reaches 33.2% mAP, demonstrating strong cross-dataset performance.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.