QUICK REVIEW

[论文解读] Rich feature hierarchies for accurate object detection and semantic segmentation

Ross Girshick, Jeff Donahue|arXiv (Cornell University)|Nov 11, 2013

Advanced Neural Network Applications参考文献 22被引用 522

一句话总结

本文提出了R-CNN（具有CNN特征的区域，Regions with CNN features），一种新颖的物体检测框架，该框架结合选择性搜索区域提议与深度卷积神经网络（CNN）进行特征提取，以及线性SVM进行分类。通过利用ImageNet预训练的迁移学习和在PASCAL VOC检测数据上的微调，R-CNN在PASCAL VOC 2012上实现了53.3%的平均平均精度（mAP），相较于先前方法相对提升了30%。

ABSTRACT

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012---achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

研究动机与目标

通过克服传统HOG特征方法和集成方法在PASCAL VOC上遇到的精度瓶颈，提升物体检测性能。
探究是否可以将大规模图像分类任务中预训练的深度CNN有效适应于标注数据有限的物体检测任务。
评估将区域提议与深度特征结合在检测和语义分割任务中的有效性。
将R-CNN的性能与滑动窗口检测器（如OverFeat）在ILSVRC2013等大规模基准上的表现进行比较。

提出的方法

该方法使用选择性搜索在每张图像上生成约2,000个与类别无关的区域提议。
将每个区域提议缩放为固定大小（227×227），作为预训练深度CNN（AlexNet）的输入，以提取深度卷积特征。
在CNN特征上训练类别特定的线性SVM，将每个区域提议分类为PASCAL VOC的20个类别之一。
通过两阶段流程对CNN进行端到端微调：首先在ImageNet上进行预训练，然后在VOC检测数据集上使用更高的初始学习率进行微调。
应用边界框回归以优化预测区域的定位，减少定位误差。
通过应用相同的基于区域的CNN特征提取和分类流程，将该框架扩展至语义分割任务。

实验结果

研究问题

RQ1与传统的手工设计特征（如HOG）相比，结合区域提议的深度卷积神经网络是否能显著提升物体检测精度？
RQ2在大规模图像分类任务（如ImageNet）上预训练后，再在较小的检测数据集上进行微调，是否能带来物体检测性能的显著提升？
RQ3在大规模检测基准上，R-CNN与滑动窗口检测器（如OverFeat）相比，其平均平均精度表现如何？
RQ4该基于区域的CNN框架在多大程度上可被适应于语义分割任务？

主要发现

R-CNN在PASCAL VOC 2012检测数据集上实现了53.3%的平均平均精度（mAP），相较于之前最佳结果相对提升了30%。
在ILSVRC2013检测数据集上，R-CNN的mAP达到31.4%，显著优于OverFeat的24.3%。
迁移学习（在ImageNet上预训练后在VOC上微调）带来了显著的性能提升，尤其在标注检测数据稀缺时更为明显。
边界框回归减少了定位误差，提升了检测精度，特别是在重叠或边界模糊的物体上效果更佳。
该框架在语义分割任务上也表现出良好的泛化能力，表明基于区域的CNN特征在检测和分割任务中均具有效性。
该方法在PASCAL VOC 2010上实现了SOTA性能，mAP达到53.7%，远超使用空间金字塔和词袋视觉特征的系统。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。