QUICK REVIEW

[论文解读] SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Amro Abbas, Kushal Tirumala|arXiv (Cornell University)|Mar 16, 2023

Data Quality and Management被引用 12

一句话总结

SemDeDup 使用来自预训练模型的嵌入来识别并去除网络级规模数据中的语义重复，数据量最多减少 50%，几乎不损失性能，并在跨视觉-语言和语言建模任务中实现更快的训练。

ABSTRACT

Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates: data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning. Analyzing a subset of LAION, we show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Moreover, performance increases out of distribution. Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches while providing efficiency gains. SemDeDup provides an example of how simple ways of leveraging quality embeddings can be used to make models learn faster with less data.

研究动机与目标

通过解决超越精确重复的语义冗余，推动大规模自监督学习的数据效率。
量化像 LAION 这样的网络级数据集中语义重复的普遍性。
证明去除语义重复可以在降低训练时间的同时保持或提升性能。
将语义去重扩展到像 C4 这样的大型文本语料库，并评估在语言建模中的效率提升。

提出的方法

使用预训练的基础模型对数据点进行嵌入（图像使用 CLIP，语言使用 OPT）。
将嵌入聚类为 k 个簇（例如 CLIP 为 k=50,000，OPT 为 k=11,000）。
在每个簇内计算成对余弦相似度，并基于阈值 1-ε 标记语义重复。
从每个重复组中，保留与簇质心的余弦相似度最低的样本，移除其余样本。
调整 ε 以控制保留的数据比例，并分析对 k 和嵌入模型选择的鲁棒性。

实验结果

研究问题

RQ1在嵌入空间中衡量时，像 LAION 这样的网络级数据集中语义重复有多普遍？
RQ2在 CLIP 和语言模型中，移除语义重复是否能在减小数据规模和训练时间的同时保持模型性能？
RQ3在不同的聚类粒度和嵌入模型下，SemDeDup 的表现如何，以及在分布外任务上的表现？
RQ4将 SemDeDup 应用于文本语料库（C4）是否在不牺牲困惑度或验证性能的情况下实现效率提升？

主要发现

LAION-440M 具有显著的语义冗余；在 ε=0.00095 时有 30% 的样本存在语义重复，在 ε=0.03 时有 50%。
去除高达 LAION-440M 的 50% 语义重复几乎不影响性能，并将训练速度提升一倍。
在 24 个任务中，去除语义重复后，零样本平均性能有所提升，在较大 prune 比例下损失最小。
在分布外任务（6 个数据集）中，37% 数据移除时 SemDeDup 优于基线，50% 移除时平均达到基线水平。
在 C4 的语言建模中，SemDeDup 优于 NearDup 基线，并通过在更小的去重数据集上训练实现有意义的算力节省。
在对去重数据继续训练更多轮次时，使用去重数据的训练可以以较少的计算量达到基线困惑度，减少 10–15% 的计算量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。