QUICK REVIEW

[论文解读] Demystifying CLIP Data

Xu Hu, Saining Xie|arXiv (Cornell University)|Sep 28, 2023

Multimodal Machine Learning Applications被引用 22

一句话总结

论文提出 MetaCLIP，一种元数据驱动的数据整理方法，通过从原始网页数据中平衡一个元数据信息子集来揭示并改进 CLIP 风格的数据采集，在不需额外建模修改的情况下，在多种 ViT 尺度下实现高于 CLIP 的零-shot ImageNet 准确率。

ABSTRACT

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.

研究动机与目标

揭示 CLIP 的数据整理方法及其对模型性能的影响，同时在保持架构和训练计划固定的前提下。
提出 MetaCLIP，一个透明、开源的数据整理管线，使用元数据派生平衡。
量化元数据引导的整理相对于原始网络数据在多种模型规模和数据规模下的收益。

提出的方法

构建一个来自 WordNet 同义词集和维基百科条目的元数据集，以映射 CLIP 的查询空间。
对来自大型数据池（CommonCrawl）的元数据条目应用子字符串匹配，以对齐图像-文本对。
从条目到文本构建倒排索引，并分析匹配分布以揭示数据特征。
通过将每个条目计数上限设定阈值 t 来平衡数据，从而让头部和尾部条目同等，降低噪声。
提供一个简单、可扩展的算法（独立抽样）以使用 M 和 t 从 D* 来整理 D，避免昂贵的倒排索引存储。
在 CLIP 风格的训练预算下，针对 ViT-B/32、ViT-B/16、ViT-L/14、ViT-H/14 进行评估。

实验结果

研究问题

RQ1元数据驱动的平衡是否能在不改变模型或训练目标的前提下提升视觉-语言预训练的数据质量与分布？
RQ2元数据整理与平衡如何影响零-shot 在 ImageNet 和广义基准集上的表现，跨模型规模与数据规模？
RQ3数据规模（400M、1B、2.5B）与平衡阈值 t 对下游准确性与数据多样性的影响？
RQ4在相同预算下，MetaCLIP 与 CLIP 及 OpenCLIP 在用网页数据训练时的比较？

主要发现

MetaCLIP 使用 400M 数据在零-shot ImageNet 上超过 CLIP 的 WIT400M 和 LAION-400M，适用于 ViT 模型（例如 ViT-B/32：70.8% 对比 68.3%）。
将元数据计数平衡为 t = 20k 比未平衡数据表现更强，并显著降低头部条目支配。
以相同预算扩展到 1B 和 2.5B 数据，ImageNet 表现保持或提升，例如 ViT-L/14 在 79.0–79.4% 与 ViT-H/14 在 ImageNet 取得 80.5%（MetaCLIP 2.5B）。
MetaCLIP 在 ViT-B/32、ViT-B/16、ViT-L/14 三者上对比 CLIP 与 OpenCLIP，在平均准确率上更高。
在线平衡（数据加载器）提供了类似的增益，显示实际部署潜力。
消融研究显示 t 在 15k–35k 左右鲁棒，400M 规模时 p=20k 常为最优；未平衡的 1.6B 数据相比平衡设定会降低 ImageNet 表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。