Skip to main content
QUICK REVIEW

[论文解读] Places: An Image Database for Deep Scene Understanding

Bolei Zhou, Aditya Khosla|arXiv (Cornell University)|Oct 6, 2016
Advanced Image and Video Retrieval Techniques参考文献 31被引用 175
一句话总结

本文介绍 Places,一个 10M-image 场景为中心的数据库,覆盖 476 个类别,通过多阶段众包和引导建立,并展示基于 CNN 的场景分类性能强劲。它还比较场景为中心的特征与对象为中心的特征,并提供基准和可视化洞察。

ABSTRACT

The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification at tasks such as object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories and attributes, comprising a quasi-exhaustive list of the types of environments encountered in the world. Using state of the art Convolutional Neural Networks, we provide impressive baseline performances at scene classification. With its high-coverage and high-diversity of exemplars, the Places Database offers an ecosystem to guide future progress on currently intractable visual recognition problems.

研究动机与目标

  • Motivate the creation of a large-scale, diverse, and category-rich scene dataset to advance deep scene understanding.
  • Describe the construction pipeline combining web data collection, crowdsourced labeling, and semi-automatic bootstrapping.
  • Establish benchmarks (Places365 variants, Places205, Places88) to enable fair evaluation of scene recognition methods.
  • Explore the effectiveness of scene-centric CNN features (Places-CNN) versus object-centric features (ImageNet-CNN) for scene classification.
  • Provide qualitative analyses and visualizations to understand learned representations in scene-centric networks.

提出的方法

  • Aggregate 10 million images from the web using SUN-derived scene categories and adjective-based queries to increase diversity.
  • Crowdsourced labeling via Amazon Mechanical Turk to select true exemplars for 476 scene categories across multiple rounds of validation.
  • Semi-automatic bootstrapping with a CNN (AlexNet) to classify remaining unlabeled images and guide targeted manual annotation.
  • Merge and disambiguate near-synonymous categories and refine labels to improve category separability.
  • Train and evaluate CNN baselines (AlexNet, GoogLeNet, VGG, and ResNet variants) on Places205 and Places365 subsets; compare against ImageNet-CNN features.
  • Analyze feature representations and provide visualization of units receptive fields and synthetic inputs to interpret learned scene concepts.

实验结果

研究问题

  • RQ1How large and diverse must a scene-centric dataset be to enable robust deep scene understanding?
  • RQ2Can crowdsourcing combined with bootstrapping reliably create a high-coverage Places dataset from web images?
  • RQ3How do scene-centric CNN features (Places-CNN) compare to object-centric features (ImageNet-CNN) on scene-centric benchmarks?
  • RQ4What benchmarks best represent progress in scene recognition, and how do different CNN architectures perform on them?
  • RQ5What do the internal units of Places-CNNs reveal about learned scene representations, and how can visualization aid interpretation?

主要发现

ModelTest setTop-1 acc.Top-5 acc.
ImageNet-AlexNet feature+SVMPlaces205 test40.80%70.20%
Places205-AlexNetPlaces205 test50.04%81.10%
Places205-GoogLeNetPlaces205 test55.50%85.66%
Places205-VGGPlaces205 test58.90%87.70%
SamExynos*Places205 test64.10%90.65%
SIAT MMLAB*Places205 test62.34%89.66%
Places205-AlexNetSUN205 test67.52%92.61%
Places205-GoogLeNetSUN205 test71.60%95.01%
Places205-VGGSUN205 test74.60%95.92%
  • Places: 10,624,928 images across 434 place categories, built via a multi-step process with crowdsourced validation and bootstrapping.
  • Places365-Standard contains 1,803,460 training images; Places365-Challenge adds ~8 million training images; Places205 has 2.5 million images across 205 categories.
  • Places-CNN features outperform ImageNet-CNN features on scene-centric tasks, with Places365-VGG achieving 63.24% Top-1 on SUN397, and hybrid 1365-VGG achieving best average across eight datasets.
  • On Places205 and SUN205, Places-CNNs (e.g., Places205-VGG, Places205-GoogLeNet) significantly surpass the ImageNet-CNN baselines in Top-1/Top-5 accuracy.
  • The unified Places benchmarks (Places365-Standard/Challenge, Places205, Places88) enable consistent evaluation and progress tracking for scene recognition research.
  • Visualization shows Places-CNN units detect scene parts (bed, chair, buildings) rather than object parts, highlighting a distinct learned representation from object-centric networks.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。