QUICK REVIEW

[论文解读] Places: An Image Database for Deep Scene Understanding

Bolei Zhou, Aditya Khosla|arXiv (Cornell University)|Oct 6, 2016

Advanced Image and Video Retrieval Techniques参考文献 31被引用 175

一句话总结

本文介绍 Places，一个 10M-image 场景为中心的数据库，覆盖 476 个类别，通过多阶段众包和引导建立，并展示基于 CNN 的场景分类性能强劲。它还比较场景为中心的特征与对象为中心的特征，并提供基准和可视化洞察。

ABSTRACT

The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification at tasks such as object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories and attributes, comprising a quasi-exhaustive list of the types of environments encountered in the world. Using state of the art Convolutional Neural Networks, we provide impressive baseline performances at scene classification. With its high-coverage and high-diversity of exemplars, the Places Database offers an ecosystem to guide future progress on currently intractable visual recognition problems.

研究动机与目标

Motivate the creation of a large-scale, diverse, and category-rich scene dataset to advance deep scene understanding.
Describe the construction pipeline combining web data collection, crowdsourced labeling, and semi-automatic bootstrapping.
Establish benchmarks (Places365 variants, Places205, Places88) to enable fair evaluation of scene recognition methods.
Explore the effectiveness of scene-centric CNN features (Places-CNN) versus object-centric features (ImageNet-CNN) for scene classification.
Provide qualitative analyses and visualizations to understand learned representations in scene-centric networks.

提出的方法

Aggregate 10 million images from the web using SUN-derived scene categories and adjective-based queries to increase diversity.
Crowdsourced labeling via Amazon Mechanical Turk to select true exemplars for 476 scene categories across multiple rounds of validation.
Semi-automatic bootstrapping with a CNN (AlexNet) to classify remaining unlabeled images and guide targeted manual annotation.
Merge and disambiguate near-synonymous categories and refine labels to improve category separability.
Train and evaluate CNN baselines (AlexNet, GoogLeNet, VGG, and ResNet variants) on Places205 and Places365 subsets; compare against ImageNet-CNN features.
Analyze feature representations and provide visualization of units receptive fields and synthetic inputs to interpret learned scene concepts.

实验结果

研究问题

RQ1How large and diverse must a scene-centric dataset be to enable robust deep scene understanding?
RQ2Can crowdsourcing combined with bootstrapping reliably create a high-coverage Places dataset from web images?
RQ3How do scene-centric CNN features (Places-CNN) compare to object-centric features (ImageNet-CNN) on scene-centric benchmarks?
RQ4What benchmarks best represent progress in scene recognition, and how do different CNN architectures perform on them?
RQ5What do the internal units of Places-CNNs reveal about learned scene representations, and how can visualization aid interpretation?

主要发现

Model	Test set	Top-1 acc.	Top-5 acc.
ImageNet-AlexNet feature+SVM	Places205 test	40.80%	70.20%
Places205-AlexNet	Places205 test	50.04%	81.10%
Places205-GoogLeNet	Places205 test	55.50%	85.66%
Places205-VGG	Places205 test	58.90%	87.70%
SamExynos*	Places205 test	64.10%	90.65%
SIAT MMLAB*	Places205 test	62.34%	89.66%
Places205-AlexNet	SUN205 test	67.52%	92.61%
Places205-GoogLeNet	SUN205 test	71.60%	95.01%
Places205-VGG	SUN205 test	74.60%	95.92%

Places: 10,624,928 images across 434 place categories, built via a multi-step process with crowdsourced validation and bootstrapping.
Places365-Standard contains 1,803,460 training images; Places365-Challenge adds ~8 million training images; Places205 has 2.5 million images across 205 categories.
Places-CNN features outperform ImageNet-CNN features on scene-centric tasks, with Places365-VGG achieving 63.24% Top-1 on SUN397, and hybrid 1365-VGG achieving best average across eight datasets.
On Places205 and SUN205, Places-CNNs (e.g., Places205-VGG, Places205-GoogLeNet) significantly surpass the ImageNet-CNN baselines in Top-1/Top-5 accuracy.
The unified Places benchmarks (Places365-Standard/Challenge, Places205, Places88) enable consistent evaluation and progress tracking for scene recognition research.
Visualization shows Places-CNN units detect scene parts (bed, chair, buildings) rather than object parts, highlighting a distinct learned representation from object-centric networks.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。