QUICK REVIEW

[论文解读] Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont|arXiv (Cornell University)|Dec 14, 2022

Multimodal Machine Learning Applications参考文献 75被引用 29

一句话总结

本论文展示了在公开公共数据（LAION-2B）上训练的类似 CLIP 的模型中，随模型大小、数据量和已看到样本数变化的幂律缩放规律，并将 OpenCLIP 与 OpenAI CLIP 在零-shot 分类和检索任务上的表现进行比较。

ABSTRACT

Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data \& models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes. We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible. Source code and instructions to reproduce this study will be available at https://github.com/LAION-AI/scaling-laws-openclip

研究动机与目标

研究扩大模型大小、数据量和已看到样本数量如何影响下游的 CLIP 性能。
评估在公共数据和开源代码情况下多模态学习的扩展规律是否成立。
比较以 LAION 数据训练的 OpenCLIP（OpenCLIP）与以 WIT 训练的 OpenAI CLIP 在各任务上的扩展行为。

提出的方法

使用 OpenCLIP 在多种尺度下训练 CLIP 模型：ViT-B/32、B/16、L/14、H/14、g/14。
使用 LAION-80M、LAION-400M 和 LAION-2B 数据子集，分别有 3B、13B 和 34B 的样本量。
在多样化的下游任务上，使用零-shot 分类、图像/文本检索、线性探针和微调进行评估。
在模型的帕累托前沿拟合幂律，将性能与总训练计算量、数据和已看到的样本量联系起来。
发布开源的评估工作流程和模型以实现可重复性。

实验结果

研究问题

RQ1当在公开数据集上进行对比语言-图像预训练（CLIP）时，幂律形式的缩放规律是否仍然成立？
RQ2模型大小、数据量和已看到样本数如何相互作用以影响零-shot 分类和检索性能？
RQ3在 LAION 数据上训练的 OpenCLIP 模型是否展现出与在 WIT 上训练的 OpenAI CLIP 模型不同的缩放行为，原因何在？
RQ4缩放趋势如何迁移到鲁棒性基准和线性探针/微调情景？

主要发现

模型/数据集	数据	架构	ImageNet（Top-1）	VTAB+（平均）	MS-COCO 检索 R@5
OpenCLIP CLIP	WIT-400M	L/14	75.5	55.8	61.1
OpenCLIP (LAION)	LAION-2B	L/14	75.2	54.6	71.1
OpenCLIP (LAION)	LAION-2B	H/14	78.0	56.4	73.4

零-shot 性能（分类和检索）随模型/数据/已看到样本的增加而呈幂律缩放。
在 LAION-2B 上的 OpenCLIP 在检索任务上表现出更强的缩放，而在 WIT 上的 OpenAI CLIP 在零-shot ImageNet 分类方面表现出更强的缩放。
观察到瓶颈效应：某些缩放维度在不增加其他维度的情况下会限制收益（例如数据与已看到样本数之间的权衡）。
随着规模的扩大，线性探针和微调的收益仍然存在，随着数据/模型的增大以及更多样本被看到，性能提升。
基于缩放规律的预测在更大尺度下显示出显著收益，并且鲁棒性随规模的提升也具有可比的改进。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。