[论文解读] Scaling Laws and Interpretability of Learning from Repeated Data
本文研究大语言模型训练中少量重复数据如何导致强烈的双拐点退化,并将其与机制可解释性联系起来,显示对复制和诱导头相关结构的损害不成比例。
Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance.
研究动机与目标
- 使用缩放法则框架研究重复数据对语言模型性能的影响。
- 描述数据重复在不同模型规模和重复频率下引发的双拐点现象。
- 考察机制可解释性方面,特别是诱导头和复制,以解释性能退化。
- 评估在重复数据上的预训练如何影响后续微调和泛化。
提出的方法
- 用大部分数据唯一、少量数据在尺寸和重复频率上跨越2–3个数量级被重复多次来训练Transformer语言模型。
- 在训练到100B标记时,改变模型规模、重复数据规模以及重复数据中的代币比例。
- 使用测试损失、以复制为重点的任务(如《哈利波特》段落复制)以及像前缀匹配和诱导头等机制性探针进行评估。
- 分析缩放法则行为并识别重复导致退化峰值的区域,与双拐点现象一致。
- 使用仅含注意力的小模型在电路水平上检查诱导头和复制,以将表象与机制联系起来。
实验结果
研究问题
- RQ1少量重复数据是否会在不同模型规模和重复频率下对语言模型性能造成不成比例的退化?
- RQ2重复数据如何影响复制和上下文学习机制(如诱导头)?
- RQ3退化模式是否可以通过双拐点动力学来解释,这与缩放法则有何关系?
- RQ4在重复数据上的预训练对随后微调的性能有什么影响?
- RQ5机制性可解释探针(复制、前缀匹配、诱导头)是否揭示重复与记忆行为之间的因果关系?
主要发现
- 重复数据引发强烈的双拐点现象,在中等重复频率时达到退化峰值。
- 对于一个8亿参数模型,将0.1%的数据重复100次,其性能下降到相当于4亿参数模型的水平。
- 退化峰值与重复数据的训练损失接近零同时发生,指示对重复子集的记忆性。
- 重复数据对复制和诱导头相关结构的影响远大于对整体测试损失的影响。
- 在退化峰值时,复制任务的有效模型规模下降多达3倍,即使测试损失的下降较少。
- 在重复下诱导头和前缀匹配受到显著退化,将记忆与模型的机制性变化联系起来。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。