[Paper Review] DSSD : Deconvolutional Single Shot Detector
DSSD adds deconvolutional encoder-decoder context to SSD with Residual-101, achieving 81.5% mAP on VOC2007 and 33.2% mAP on COCO, outperforming prior single-network detectors.
The main contribution of this paper is an approach for introducing additional context into state-of-the-art general object detection. To achieve this we first combine a state-of-the-art classifier (Residual-101[14]) with a fast detection framework (SSD[18]). We then augment SSD+Residual-101 with deconvolution layers to introduce additional large-scale context in object detection and improve accuracy, especially for small objects, calling our resulting system DSSD for deconvolutional single shot detector. While these two contributions are easily described at a high-level, a naive implementation does not succeed. Instead we show that carefully adding additional stages of learned transformations, specifically a module for feed-forward connections in deconvolution and a new output module, enables this new approach and forms a potential way forward for further detection research. Results are shown on both PASCAL VOC and COCO detection. Our DSSD with $513 imes 513$ input achieves 81.5% mAP on VOC2007 test, 80.0% mAP on VOC2012 test, and 33.2% mAP on COCO, outperforming a state-of-the-art method R-FCN[3] on each dataset.
Motivation & Objective
- Motivate improving general object detection by injecting larger-scale contextual information.
- Investigate replacing VGG with a deeper backbone (Residual-101) in SSD for higher accuracy.
- Develop a deconvolution-based hourglass module to pass semantic context to later prediction layers.
- Introduce a prediction module and a deconvolution module to stabilize training and improve small-object detection.
Proposed method
- Replace VGG with Residual-101 as the base network in SSD to improve feature quality.
- Add a prediction module with residual blocks to enhance prediction layers and stabilize training.
- Attach deconvolution layers after SSD to form an asymmetric encoder-decoder (hourglass) network.
- Incorporate a deconvolution module with batch normalization and learned upsampling, combined via element-wise product for context fusion.
- Use skip connections to pass high-level context to finer-resolution feature maps, creating DSSD.
- Train in two stages: first freeze SSD and train deconvolution side, then fine-tune entire network; adopt SSD-like data augmentation and adjusted aspect ratios for default boxes.
Experimental results
Research questions
- RQ1Can adding a deconvolution-based encoder-decoder (hourglass) structure to SSD improve accuracy, especially for small objects?
- RQ2Does replacing VGG with Residual-101 and introducing a dedicated prediction module improve VOC/COCO detection performance without sacrificing speed?
- RQ3What is the impact of different feature fusion strategies (sum vs product) in the deconvolution module on detection accuracy?
- RQ4How does training strategy (two-stage training with frozen backbone followed by full fine-tuning) affect convergence and final performance?
Key findings
- DSSD with Residual-101 and deconvolution layers achieves higher accuracy than SSD and competitive state-of-the-art methods on VOC and COCO.
- Prediction modules and deconvolution modules significantly improve mAP, especially for small objects and context-specific classes.
- Element-wise product fusion in the deconvolution module yields best accuracy among tested fusion methods.
- On VOC2007, DSSD with 513 input achieves 81.5% mAP, outperforming prior single-network detectors like R-FCN and SSD variants.
- On VOC2012, DSSD achieves 80.0% mAP, and on COCO, DSSD 513 reaches 33.2% mAP, demonstrating strong cross-dataset performance.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.