QUICK REVIEW

[论文解读] Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

Chen Sun, Abhinav Shrivastava|arXiv (Cornell University)|Jul 10, 2017

Advanced Neural Network Applications参考文献 40被引用 303

一句话总结

本文将数据规模扩展到3亿张图片（JFT-300M），以研究预训练数据量对视觉表征的影响，结果显示性能随数据量呈对数增长，且更高容量的模型获益更多，在多任务上达到新的SOTA。

ABSTRACT

The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10x or 100x? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between `enormous data' and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pre-training) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-the-art results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.

研究动机与目标

评估增加预训练数据量如何影响跨任务的视觉表征学习（分类、检测、分割、姿态估计）。
评估数据量与性能之间的关系，包括使用更高容量模型时。
展示通过对大型、嘈杂的网页抓取数据进行预训练所取得的最新SOTA结果。
分析模型容量、类别数量和数据质量等因素对迁移学习性能的影响。

提出的方法

在 JFT-300M 上训练一个 101 层的 ResNet（ResNet-101），拥有 18291 个标签且标签噪声约为 20%。
在 JFT-300M 上进行预训练，并在 ImageNet、COCO、PASCAL VOC 与 COCO Pose 基准上进行微调或评估表征。
由于多标签性质，使用逐标签的逻辑损失，并引入标签层次结构以填充缺失标签。
通过特征提取（冻结）和微调（从 JFT-300M 初始化）来评估表征。
与 ImageNet 基线进行比较，并对数据量、类别数量和模型容量进行消融分析。
采用跨 50 个 GPU 的异步分布式训练，使用 Downpour SGD 与参数服务器。

实验结果

研究问题

RQ1使用大容量模型时，增加预训练数据量是否会在视觉任务中带来性能提升？
RQ2表征质量如何随数据量（对数增长 vs. 线性增长）和模型容量而扩展？
RQ3类别数量和标签噪声对迁移学习性能有什么影响？
RQ4更大基础模型是否能从海量数据中获得更大收益？
RQ5数据质量（噪声）与数量在提升下游任务中的作用是什么？

主要发现

随着预训练数据增大，视觉任务的性能提升，且增益随数据量增加呈对数增长。
来自大规模数据的更好表征学习显著提升检测、分割和姿态估计等下游任务。
模型容量至关重要；更高容量的模型（如 ResNet-152）比小模型从 300M 数据中获得更大收益。
使用长尾数据训练并不阻碍收敛，仍然带来精度提升。
使用 JFT-300M 预训在 COCO 检测、PASCAL VOC、语义分割和人体姿态估计上取得新的SOTA。
从 JFT-300M 初始化微调在若干基准上优于 ImageNet 初始化。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。