[论文解读] RedPajama: an Open Dataset for Training Large Language Models
该论文发布 RedPajama-V1(LLaMA 训练数据的开放再现)和 RedPajama-V2(包含高质量信号的大规模网页数据),以促进透明、可扩展的开源 LLM 发展,并展示通过消融研究表明质量信号如何提升模型性能。
Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata. Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong language models used in production, such as Snowflake Arctic, Salesforce's XGen and AI2's OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only language models with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing language models at scale.
研究动机与目标
- 证明在开源 LLM 中需要透明的数据筛选并公开数据集。
- 提供 RedPajama-V1 作为 LLaMA 训练数据的开源再现,RedPajama-V2 作为带有质量信号的大规模网页数据集。
- 展示如何使用质量信号来筛选出更高质量的数据子集并提升模型性能。
- 描述在数据集上训练的 RedPajama-INCITE 模型,并评估其在开放基线下的表现。
提出的方法
- 复现 LLaMA 训练语料库,以创建 RedPajama-V1,并附带详细文档和处理步骤。
- 通过在五种语言中抓取 84 个 Common Crawl 快照(2014–2023),并为每个文档附加 46 个质量信号来创建 RedPajama-V2。
- 发布包含自然语言、重复性、基于内容、ML 启发式和去重指标等质量信号。
- 在 Summit 上训练 REDPajama-INCITE 模型(3B 和 7B),并进行定制化工程以解决架构和 FP16 的限制。
- 对解码器单独模型(468M 和 1.6B)进行消融,评估质量信号对下游 NLP 基准测试的影响。
- 将 RedPajama 变体与开放基线在聚合基准指标上进行对比。
![Figure 1: The ecosystem around the RedPajama datasets. RedPajama has provided pretraining data for multiple open-source LLMs, including OpenELM [ 36 ] , OLMo [ 19 ] , Snowflake’s Arctic [ 54 ] and RedPajama-INCITE. SlimPajama is a cleaned and deduplicated version of RedPajama-V1.](https://ar5iv.labs.arxiv.org/html/2411.12372/assets/figures/rp-ecosystem-v2.2.png)
实验结果
研究问题
- RQ1如何使开源 LLM 数据集更加透明和可复现?
- RQ2在网页衍生的预训练数据质量和模型性能上应用各种质量信号的影响是什么?
- RQ3超大规模的开源网页数据集(RPv2)是否能够在标准基准上实现具有竞争力的开放 LLM?
- RQ4在普通硬件或有限 HPC 资源上复现大规模训练语料时的权衡和实践考虑是什么?
主要发现
- RPv1 真实再现了 LLaMA 训练的语料,并提供可复现的开放基线。
- RPv2 提供了包含每文档 46 个质量信号的大规模网页语料,支持 principled 筛选和消融分析。
- 质量信号在468M和1.6B 参数模型的消融实验中对下游基准表现有实质性影响。
- 在 Summit 上训练的 RedPajama-INCITE 模型在与同等规模的开放模型相比时,少样本和零样本表现具有竞争力,指令变体在少样本任务上表现出色。
- 消融研究显示不同质量筛选规则如何影响平均基准性能和困惑度。
- RPv2 的元数据丰富设计有助于对高质量数据子集进行快速实验。
![Figure 2: RedPajama-INCITE-Base 3B results on a subset of lm-evaluation-harness. The tasks were selected according to the selection made to evaluate Pythia [ 4 ] and GPT-J [ 59 ]](https://ar5iv.labs.arxiv.org/html/2411.12372/assets/figures/rp_incite.png)
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。