[论文解读] Places: An Image Database for Deep Scene Understanding
本文介绍 Places,一个 10M-image 场景为中心的数据库,覆盖 476 个类别,通过多阶段众包和引导建立,并展示基于 CNN 的场景分类性能强劲。它还比较场景为中心的特征与对象为中心的特征,并提供基准和可视化洞察。
The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification at tasks such as object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories and attributes, comprising a quasi-exhaustive list of the types of environments encountered in the world. Using state of the art Convolutional Neural Networks, we provide impressive baseline performances at scene classification. With its high-coverage and high-diversity of exemplars, the Places Database offers an ecosystem to guide future progress on currently intractable visual recognition problems.
研究动机与目标
- Motivate the creation of a large-scale, diverse, and category-rich scene dataset to advance deep scene understanding.
- Describe the construction pipeline combining web data collection, crowdsourced labeling, and semi-automatic bootstrapping.
- Establish benchmarks (Places365 variants, Places205, Places88) to enable fair evaluation of scene recognition methods.
- Explore the effectiveness of scene-centric CNN features (Places-CNN) versus object-centric features (ImageNet-CNN) for scene classification.
- Provide qualitative analyses and visualizations to understand learned representations in scene-centric networks.
提出的方法
- Aggregate 10 million images from the web using SUN-derived scene categories and adjective-based queries to increase diversity.
- Crowdsourced labeling via Amazon Mechanical Turk to select true exemplars for 476 scene categories across multiple rounds of validation.
- Semi-automatic bootstrapping with a CNN (AlexNet) to classify remaining unlabeled images and guide targeted manual annotation.
- Merge and disambiguate near-synonymous categories and refine labels to improve category separability.
- Train and evaluate CNN baselines (AlexNet, GoogLeNet, VGG, and ResNet variants) on Places205 and Places365 subsets; compare against ImageNet-CNN features.
- Analyze feature representations and provide visualization of units receptive fields and synthetic inputs to interpret learned scene concepts.
实验结果
研究问题
- RQ1How large and diverse must a scene-centric dataset be to enable robust deep scene understanding?
- RQ2Can crowdsourcing combined with bootstrapping reliably create a high-coverage Places dataset from web images?
- RQ3How do scene-centric CNN features (Places-CNN) compare to object-centric features (ImageNet-CNN) on scene-centric benchmarks?
- RQ4What benchmarks best represent progress in scene recognition, and how do different CNN architectures perform on them?
- RQ5What do the internal units of Places-CNNs reveal about learned scene representations, and how can visualization aid interpretation?
主要发现
| Model | Test set | Top-1 acc. | Top-5 acc. |
|---|---|---|---|
| ImageNet-AlexNet feature+SVM | Places205 test | 40.80% | 70.20% |
| Places205-AlexNet | Places205 test | 50.04% | 81.10% |
| Places205-GoogLeNet | Places205 test | 55.50% | 85.66% |
| Places205-VGG | Places205 test | 58.90% | 87.70% |
| SamExynos* | Places205 test | 64.10% | 90.65% |
| SIAT MMLAB* | Places205 test | 62.34% | 89.66% |
| Places205-AlexNet | SUN205 test | 67.52% | 92.61% |
| Places205-GoogLeNet | SUN205 test | 71.60% | 95.01% |
| Places205-VGG | SUN205 test | 74.60% | 95.92% |
- Places: 10,624,928 images across 434 place categories, built via a multi-step process with crowdsourced validation and bootstrapping.
- Places365-Standard contains 1,803,460 training images; Places365-Challenge adds ~8 million training images; Places205 has 2.5 million images across 205 categories.
- Places-CNN features outperform ImageNet-CNN features on scene-centric tasks, with Places365-VGG achieving 63.24% Top-1 on SUN397, and hybrid 1365-VGG achieving best average across eight datasets.
- On Places205 and SUN205, Places-CNNs (e.g., Places205-VGG, Places205-GoogLeNet) significantly surpass the ImageNet-CNN baselines in Top-1/Top-5 accuracy.
- The unified Places benchmarks (Places365-Standard/Challenge, Places205, Places88) enable consistent evaluation and progress tracking for scene recognition research.
- Visualization shows Places-CNN units detect scene parts (bed, chair, buildings) rather than object parts, highlighting a distinct learned representation from object-centric networks.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。