QUICK REVIEW

[论文解读] QuRating: Selecting High-Quality Data for Training Language Models

Alexander Wettig, Aatmik Gupta|arXiv (Cornell University)|Feb 15, 2024

Natural Language Processing Techniques被引用 6

一句话总结

QuRating 使用对比人类和大语言模型的判断来学习文本的标量质量评分（QuRater），覆盖四个标准，然后根据这些评分抽样数据，用 13 亿参数的语言模型在 300 亿标记上训练，在上下文学习和困惑度方面相对于基线有所改进。

ABSTRACT

Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that can capture human intuitions about data quality. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value - and find that LLMs are able to discern these qualities, especially when making pairwise judgments of texts. We train a QuRater model to learn scalar ratings from pairwise judgments, and use it to annotate a 260B training corpus with quality ratings for each of the four criteria. In our experiments, we select 30B tokens according to the different quality ratings and train 1.3B-parameter language models on the selected data. We find that it is important to balance quality and diversity. When we sample using quality ratings as logits over documents, our models obtain lower perplexity and stronger in-context learning performance than baselines. Our best model is based on educational value and performs similarly to a model trained with uniform sampling for 50% more steps. Beyond data selection, we use the quality ratings to construct a training curriculum which improves performance without changing the training dataset. We extensively analyze the quality ratings and discuss their characteristics, biases, and wider implications.

研究动机与目标

捕捉人类感知的抽象文本质量，以指导对大语言模型预训练的数据选择。
通过大语言模型收集成对判断并使用 QuRater 模型学习标量质量评分。
用四个质量标准（写作风格、事实与琐事、教育价值、所需专业知识）标注大规模语料。
评估基于质量的数据选择与课程安排如何影响模型在各任务上的性能和覆盖范围。

提出的方法

将质量标准形式化为每个标准下文本之间的成对比较。
从大语言模型收集成对判断并使用 Bradley-Terry 模型将其转换为标量评分。
训练一个 13 亿参数的 QuRater（Sheared-Llama），具备多任务头以从文本输入预测四个质量评分。
对 260B 标记的 SlimPajama 派生语料进行标注，生成 QuRatedPajama 并对四个标准都给出评分。
按照评分使用温度受控采样（tau）从 QuRatedPajama 中抽取 3000 亿标记，以在质量与多样性之间取得平衡。
在选定数据上训练 13 亿参数的语言模型，并通过困惑度和上下文学习在 10 个任务上进行评估；探索使用质量等级排序的训练课程。

实验结果

研究问题

RQ1大语言模型派生的成对判断是否可以可靠地捕捉跨不同标准的文本抽象质量？
RQ2当用于数据选择时，四个质量标准与下游语言模型性能之间的相关性如何？
RQ3使用基于质量的对数概率（带温度）的采样是否优于均匀采样和基于困惑度的筛选？
RQ4 教育价值、写作风格、事实与琐事以及所需专业知识是否能为语言模型训练中的有效课程设计提供指导？
RQ5在将 QuRating 应用于多样化领域和社交内容时会产生哪些偏见或局限性？

主要发现

成对判断相比直接对文本评分提供了更稳定的质量信号。
在 tau=2.0 的情况下，教育价值在所有任务中的上下文学习方面表现持续改善。
事实与琐事和书写风格在某些任务上有所提升，但并非对上下文学习普遍更优；书写风格对困惑度提升最显著。
仅选择最高评分的文档可能会损害性能；通过温度实现质量与多样性的平衡是有益的。
按质量评分排序（例如按专业知识水平递增）进行课程设计可以在不改变数据池的情况下提升性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。