QUICK REVIEW

[论文解读] WebVision Database: Visual Learning and Understanding from Web Data

Wen Li, Limin Wang|arXiv (Cornell University)|Aug 9, 2017

Domain Adaptation and Few-Shot Learning参考文献 32被引用 313

一句话总结

介绍 WebVision，一个包含元信息的240万图像的网络数据集，用于研究从嘈杂的网络数据中学习的视觉识别和领域自适应，并在与 ILSVRC 2012 的泛化能力方面具有竞争力，对 Caltech-256 和 PASCAL VOC 2007 的迁移也表现强劲。

ABSTRACT

In this paper, we present a study on learning visual recognition models from large scale noisy web data. We build a new database called WebVision, which contains more than $2.4$ million web images crawled from the Internet by using queries generated from the 1,000 semantic concepts of the benchmark ILSVRC 2012 dataset. Meta information along with those web images (e.g., title, description, tags, etc.) are also crawled. A validation set and test set containing human annotated images are also provided to facilitate algorithmic development. Based on our new database, we obtain a few interesting observations: 1) the noisy web images are sufficient for training a good deep CNN model for visual recognition; 2) the model learnt from our WebVision database exhibits comparable or even better generalization ability than the one trained from the ILSVRC 2012 dataset when being transferred to new datasets and tasks; 3) a domain adaptation issue (a.k.a., dataset bias) is observed, which means the dataset can be used as the largest benchmark dataset for visual domain adaptation. Our new WebVision database and relevant studies in this work would benefit the advance of learning state-of-the-art visual models with minimum supervision based on web data.

研究动机与目标

评估嘈杂的网页标签对视觉识别的影响，与人工标注数据相比。
评估在 WebVision 上训练的模型对其他数据集和任务的泛化能力。
探索随附于网络图像的元信息在识别任务中的有用性。
研究 WebVision 与 ILSVRC 2012 之间的数据集偏倚及其对领域自适应的影响。

提出的方法

使用 1,000 个 ILSVRC 2012 同义集作为查询，从 Flickr 和 Google Image Search 构建包含 240 万张图像的 WebVision 数据集。
收集网页图像的元信息（标题、描述、标签等）。
通过 AMT 创建 10 万人类标注子集（5 万用于验证，5 万用于测试），并进行近重复项去除及 3 票质量投票。
在 WebVision 和 ILSVRC 2012 上训练基线的 AlexNet 模型，并比较验证集上的跨数据集性能。
利用在 WebVision 与 ILSVRC 2012 上训练的特征对 Caltech-256、PASCAL VOC 2007 以及 Faster R-CNN 的目标检测进行迁移学习评估。
通过对 WebVision 和 ILSVRC 2012 图像进行子采样，以研究标签噪声、数据量与质量的关系对识别性能的影响。

实验结果

研究问题

RQ1嘈杂的网络标签数据能否训练出与人类标注数据相竞争的视觉识别模型？
RQ2在 WebVision 上训练的模型对其他数据集和任务的泛化能力（迁移学习）如何？
RQ3网页数据中的标签噪声与数据量之间的影响是什么？
RQ4随附于网络图像的元信息是否提升识别性能或使多模态学习成为可能？
RQ5WebVision 与 ILSVRC 2012 之间是否存在可衡量的数据集偏差，WebVision 是否可以作为领域自适应的基准？

主要发现

模型	ILSVRC 2012 验证 Top-1	ILSVRC 2012 验证 Top-5	WebVision 验证 Top-1	WebVision 验证 Top-5
ILSVRC 2012	79.77	56.79	74.64	52.58
WebVision	70.36	47.55	77.90	57.03

WebVision 即使在存在显著标签噪声的情况下也能支持稳健的卷积神经网络模型，大规模数据有助于减轻噪声的影响。
在 Caltech-256 和 PASCAL VOC 2007 上，WebVision 训练的模型的泛化能力与 ILSVRC 2012 模型相当或更好，且在 PASCAL VOC 2007 的目标检测方面亦如此。
WebVision 与 ILSVRC 2012 之间存在领域偏差，体现在跨数据集性能下降，但 WebVision 的特征对其他任务的迁移表现良好。
与网络图像相关的元信息有潜力支持多模态与领域自适应研究（通过数据集偏差观察所示）。
在大规模设置中，增加网络图像数量比单纯提高标签质量更能缓解标签噪声；数量的好处超过噪声。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。