QUICK REVIEW

[Paper Review] Rethinking Classification and Localization for Object Detection

Yue Wu, Yinpeng Chen|arXiv (Cornell University)|Apr 13, 2019

Advanced Neural Network Applications47 references40 citations

TL;DR

The paper analyzes how fully connected and convolutional detection heads differently affect classification and localization, and introduces a Double-Head detector that combines fc-head for classification with conv-head for bounding box regression, yielding notable AP gains on COCO.

ABSTRACT

Two head structures (i.e. fully connected head and convolution head) have been widely used in R-CNN based detectors for classification and localization tasks. However, there is a lack of understanding of how does these two head structures work for these two tasks. To address this issue, we perform a thorough analysis and find an interesting fact that the two head structures have opposite preferences towards the two tasks. Specifically, the fully connected head (fc-head) is more suitable for the classification task, while the convolution head (conv-head) is more suitable for the localization task. Furthermore, we examine the output feature maps of both heads and find that fc-head has more spatial sensitivity than conv-head. Thus, fc-head has more capability to distinguish a complete object from part of an object, but is not robust to regress the whole object. Based upon these findings, we propose a Double-Head method, which has a fully connected head focusing on classification and a convolution head for bounding box regression. Without bells and whistles, our method gains +3.5 and +2.8 AP on MS COCO dataset from Feature Pyramid Network (FPN) baselines with ResNet-50 and ResNet-101 backbones, respectively.

Motivation & Objective

Understand how fc-head and conv-head influence classification and localization in two-stage detectors.
Empirically compare fc-head and conv-head using predefined proposals on MS COCO 2017 validation.
Identify complementary strengths and weaknesses of the two heads.
Propose a joint architecture (Double-Head) that leverages both heads for improved detection.
Explore extensions leveraging unfocused tasks to further boost accuracy.

Proposed method

Train and compare fc-head and conv-head on FPN with ResNet-50 to assess classification vs localization performance.
Analyze output feature maps to measure spatial sensitivity and correlation with IoU.
Propose Double-Head architecture: fc-head for classification and conv-head for bbox regression.
Extend to Double-Head-Ext by incorporating unfocused task supervision and classifier fusion during inference.
Evaluate on COCO and VOC07 with ablations on backbones and head configurations.

Experimental results

Research questions

RQ1Do fc-heads and conv-heads have complementary strengths for classification and localization?
RQ2How does spatial sensitivity differ between fc-head and conv-head and how does it affect IoU correlation?
RQ3Can separating tasks into two heads improve detection performance over single-head baselines?
RQ4Does incorporating unfocused tasks and classifier fusion further enhance accuracy?

Key findings

fc-head yields higher classification scores that correlate more with IoU than conv-head, especially for small objects.
conv-head provides more accurate bounding box regression than fc-head.
Double-Head (fc-head for classification and conv-head for regression) outperforms both single-head baselines on COCO with ResNet-50 and ResNet-101 backbones.
Double-Head-Ext further improves results by supervising unfocused tasks and fusing classifiers, achieving state-of-the-art-like gains on COCO test-dev for a single training stage.
On VOC07, Double-Head-Ext surpasses the FPN baseline by noticeable margins across AP, AP@0.5, and AP@0.75.
With COCO val2017 results, Double-Head-Ext reaches 42.3 AP (ResNet-101) and 49+% AP at various thresholds compared to baselines.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.