QUICK REVIEW

[论文解读] A Note on Normalized Emergence Timing (in Pythia Language Model Evaluations)

Rooks, Tyler Cason|arXiv (Cornell University)|Apr 3, 2023

Topic Modeling被引用 163

一句话总结

这篇论文介绍了公开可用的 Pythia 套件，在相同数据顺序上训练的 LLMs，并分析训练动态、放大效应，以及关于偏见、记忆与词频影响的案例研究。

ABSTRACT

How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce extit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend extit{Pythia} to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics. Trained models, analysis code, training code, and training data can be found at \url{https://github.com/EleutherAI/pythia}.

研究动机与目标

通过提供一个标准化、公开可访问的模型套件，促进对大型语言模型的科学研究。
调查训练数据顺序、去重和模型规模如何影响学习动态与偏差。
检查在训练进展中预训练词频对下游任务性能的作用。

提出的方法

提供一个包含 8 种模型规模（70M 到 12B 参数）的套件，在相同数据顺序下训练，公开检查点（每个模型 154 个）。
在 Pile 和经去重的 Pile 上各训练两份套件，以研究数据效应。
使用密集并行注意力和旋转嵌入，且嵌入矩阵不绑定以提升解释性与效率。
使用大批量训练（1024），基于 GPT-NeoX 框架、ZeRO、数据/张量并行和 Flash Attention 提升可扩展性。
使用 Language Model Evaluation Harness 在八个基准测试上评估，以与 OPT/BLOOM 基线进行比较。
将所有模型、检查点和评测代码在 Apache 2.0 许可下发布以实现完整可复现性。

Figure 1 : The CrowS-Pairs gender bias, shown as the percentage of times that the perplexity of the stereotyping sentence is lower than its less stereotyped counterpart (% Stereotype) for the Pythia models of different sizes at the end of training. We also show the effect of the gender swapping inte

实验结果

研究问题

RQ1训练数据顺序和去重如何影响不同尺度模型的性能与记忆？
RQ2预训练词频对训练过程中任务性能的影响有多大？
RQ3并行注意力加上 MLP 层的架构选择对小模型与大模型的性能有何影响？
RQ4通过代词频率修改进行性别偏见干预对不同模型规模的下游偏见测量有何影响？

主要发现

去重对 Pythia 模型的语言建模没有显著的性能提升。
在各尺度上并行注意力 + MLP 实现等价性能，与一些先前说法相反。
对于 BLOOM，存在较小且不一致的“多语言诅咒”现象，依赖基准测试，建议用多样化任务重新评估。
泊松点过程很好地模型化了记忆化的时序，表明训练顺序对记忆序列的影响有限。
在约 65,000 次训练步骤（约完成 45% 的训练）时，较大模型（2.8B 及以上）开始在任务准确性与预训练词频之间显示相关性，发生了显著的相变。
在训练的最后 7% 或 21% 的阶段，对代词频率的干预在特定基准上降低性别偏见，同时对基线任务的困惑度没有重大影响。

Figure 2 : The WinoBias gender bias results, shown as the proportion of the time that the model placed a higher log probability on the more stereotyped pronoun as an answer to a multiple choice gender–occupation co-reference question.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。