QUICK REVIEW

[论文解读] The Role of Pre-training Data in Transfer Learning

Rahim Entezari, Mitchell Wortsman|arXiv (Cornell University)|Feb 27, 2023

Domain Adaptation and Few-Shot Learning被引用 8

一句话总结

论文系统研究预训练数据分布、数量和方法如何影响迁移学习性能，特别是在少量样本与全量微调情景下，显示数据质量和规模在某些情况下能够弥补较差分布的影响。

ABSTRACT

The transfer learning paradigm of model pre-training and subsequent fine-tuning produces high-accuracy models. While most studies recommend scaling the pre-training size to benefit most from transfer learning, a question remains: what data and method should be used for pre-training? We investigate the impact of pre-training data distribution on the few-shot and full fine-tuning performance using 3 pre-training methods (supervised, contrastive language-image and image-image), 7 pre-training datasets, and 9 downstream datasets. Through extensive controlled experiments, we find that the choice of the pre-training data source is essential for the few-shot transfer, but its role decreases as more data is made available for fine-tuning. Additionally, we explore the role of data curation and examine the trade-offs between label noise and the size of the pre-training dataset. We find that using 2000X more pre-training data from LAION can match the performance of supervised ImageNet pre-training. Furthermore, we investigate the effect of pre-training methods, comparing language-image contrastive vs. image-image contrastive, and find that the latter leads to better downstream accuracy

研究动机与目标

研究不同的预训练数据分布如何在少样本和全量微调下影响下游迁移性能。
评估数据整理和标签噪声对迁移学习性能的影响。
比较预训练方法（监督、CLIP、SimCLR）及其对迁移性的影响。
评估预训练数据集规模和数据质量如何与下游任务绩效相互作用，在多样化任务中表现。

提出的方法

在七个预训练数据集和九个下游任务中，使用 ResNet-50 作为图像编码器进行基于 CLIP 的预训练。
在下游数据集上对预训练模型进行端到端微调，并对超参数进行网格搜索。
比较监督、CLIP 和 SimCLR 的预训练损失，并分析少样本与全量微调的表现差异。
系统地改变预训练数据源、数据集规模和标题/文本质量，以分析迁移效果。
通过比较 ImageNet 的带标注的数据与 Flickr 标注对比、以及与 LAION 分布，评估数据整理的影响。

实验结果

研究问题

RQ1不同的预训练数据分布在少样本设置中是否产生不同的迁移性能？
RQ2预训练数据质量与整理相较于噪声多、规模更大的数据集，对下游迁移有什么影响？
RQ3预训练数据规模对跨下游任务的迁移性能的相对影响是什么？
RQ4不同的预训练方法（监督、CLIP、SimCLR）在迁移性方面的比较如何？
RQ5在极大规模的嘈杂数据集（如 LAION）是否在某些任务上能够匹配或超越经过整理、带标注的预训练（ImageNet）？

主要发现

在少样本迁移中，预训练数据源的差异很显著，但随着微调数据增多，差异基本消失。
在大多数下游任务中，即使是最差的预训练数据集也优于从零开始训练。
经过精心整理、提高标题/文本质量（模板化标题）的改进显著提升迁移准确率，相较于原始 Flickr 标注。
扩充预训练数据有帮助，LAION-2B 在某些任务上优于 ImageNet，但收益因任务而异，且对部分任务趋于饱和。
SimCLR 预训练在少样本迁移通常优于 CLIP，但随着下游数据增多，差异缩小。
在大规模、高度数据化的情形下，LAION 数据在某些情况下可以匹配或超越 ImageNet，但并非对所有任务都适用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。