QUICK REVIEW

[论文解读] A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Shayne Longpre, Gregory Yauney|arXiv (Cornell University)|May 22, 2023

Topic Modeling被引用 7

一句话总结

本研究对28个模型进行预训练，以量化数据年龄、质量/毒性过滤和领域组成如何影响语言模型性能，揭示没有一刀切的过滤策略，以及异质数据源的价值。

ABSTRACT

Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development.

研究动机与目标

衡量预训练数据年龄如何影响下游性能和微调结果的效果。
评估质量和毒性过滤如何改变模型行为和任务表现。
评估领域组成（书籍、网页等）对泛化与毒性产生的影响。
为语言模型预训练中的数据筛选提供实际建议。
在包含1.5B参数的28个模型的大规模集合中验证发现，以揭示文本预训练的直觉。

提出的方法

在沿时间、毒性/质量过滤或领域组成修改的数据集上，对28个解码器模型（LM-XL）进行1.5B参数的预训练。
以C4和Pile为起始数据集，应用多种过滤器（质量阈值、毒性阈值、反向过滤）。
对数据集进行去重，并与未过滤的基线数据集进行对比。
在QA、毒性识别和毒性生成任务的下游性能上，对时序变化和领域多样的基准进行评估。
分析观测数据特征（PII、可读性、长度等），以为过滤效果提供背景 Context。

实验结果

研究问题

RQ1预训练数据年龄对下游模型性能和微调效果有何影响？
RQ2质量和毒性过滤在模型性能与毒性相关行为之间如何权衡？
RQ3预训练数据的领域组成如何影响泛化与毒性生成？
RQ4从高层文本域特征是否可以预测过滤效果？
RQ5异质数据源（书籍、网页）的包容性是否对下游任务带来的一致性收益？

主要发现

预训练与评估数据之间的时间错配会降低性能，且对更大模型影响更明显。
质量过滤在降低数据量的同时提升下游任务性能，而毒性过滤可能降低泛化和QA性能。
毒性与质量并非总是一致；高毒性内容可能带来更高的质量信号，且仅凭领域特征并不能预测过滤结果。
包含书籍和网页等异质数据源通常提升性能，其中书籍对毒性贡献较高。
模型性能受数据年龄和领域混合的影响呈非一刀切的模式，强调需要对数据筛选进行细致的策略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。

[论文解读] A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, &amp; Toxicity