QUICK REVIEW

[论文解读] DataComp-LM: In search of the next generation of training sets for language models

Jeffrey Li, Alex Chengyu Fang|arXiv (Cornell University)|Jun 17, 2024

Natural Language Processing Techniques被引用 7

一句话总结

DataComp-LM (DCLM) 引入一个标准化的 240T 标记的 Common Crawl 派生语料库，以及一个用于评估语言模型预训练数据整理的数据整理框架，结果显示基于模型的筛选能产生高质量的训练集，并在 7B 模型上达到 2.6T 标记的开源数据最高水平。

ABSTRACT

We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.

研究动机与目标

建立一个受控的语言模型训练数据整理基准，以在数据质量与模型结构及训练选择之间实现解耦。
提供一个大规模、标准化的数据池（DCLM-Pool）和开放工具，以在不同模型规模（400M 到 7B 参数）上实现可重复的筛选、去重和混合实验。
评估数据整理策略（去重、筛选、数据混合）对在共同评估体系下的下游性能的影响。
在相较于先前开放数据集的前提下，识别能提升性能且降低计算成本的有效数据整理做法。

提出的方法

创建 DCLM-Pool：一个基于 resiliparse 的 HTML 提取，将 Common Crawl 派生的非过滤的 240T 标记语料库。
定义一个五个计算尺度的多尺度基准（400M-1x、1B-1x、1B-5x、7B-1x、7B-2x）以及基于 OpenLM 的标准化训练方案，以隔离数据影响。
通过两条路径评估数据整理管线：筛选（从池中选择）和混合（组合多个来源）。
应用 53 项下游任务（MMLU 5-shot、Core、Extended 指标及其他）来评定数据集质量。
通过消融实验研究数据设计组成部分（文本提取、去重、基于模型的质量筛选、混合），形成 DCLM-baseline 数据集。

Figure 1: Improving training sets leads to better models that are cheaper to train. Using DataComp-LM, we develop a high-quality dataset, DCLM-baseline , which we use to train models with state-of-the-art trade-off between compute and performance. We compare on both (left) a Core set of tasks and on

实验结果

研究问题

RQ1哪些数据整理策略（去重、筛选、混合）能在基线语言模型上带来最佳下游性能？
RQ2文本提取方法和去重在多种计算尺度下如何影响模型性能？
RQ3基于模型的筛选在多大程度上能改进数据集质量，相较于启发式方法？
RQ4将高质量来源与 Common Crawl 派生数据混合在不同尺度上是否有助于或有害于性能？
RQ5在相对私有数据集的计算资源有限时，是否可以使高质量的开放数据集（DCLM-baseline）达到接近最先进的性能？

主要发现

基于模型的筛选是 DCLM-baseline 有效数据整理的关键组成部分。
基于 fastText 的筛选，使用 OH-2.5 + ELI5 正向数据以及前10% 阈值，在 Core 与 MMLU 上表现出色。
在 7B 模型上以 2.6T 标记训练的 DCLM-baseline 达到 64% 的 MMLU（5-shot），优于若干用更多计算训练的开放权重基线。
DCLM-baseline 达到 64% 的 MMLU，与 Mistral-7B-v0.3（63%）和 Llama 3 8B（66%）相当，同时比 Llama 3 8B 少用约 6.6 倍的计算。
将高质量来源与 CC 混合可提升某些子集（如 C4、RPJ-CC），但对 DCLM-baseline 可能有害，表明混合效应取决于基础数据质量。
在 2.6T 标记下用 DCLM-baseline 训练的 7B 模型接近开源数据模型的最先进水平，在同等规模下也与封闭数据模型相竞争。

Figure 2: The DCLM workflow. (A) A participant first chooses a scale, where larger scales reflect more training tokens or model parameters. (B) A participant then filters a pool of data (filtering track) or mixes data of their own (mixing track) to create a dataset. (C) Using the curated dataset, a

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。