QUICK REVIEW

[论文解读] Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W. Rae, Sebastian Borgeaud|arXiv (Cornell University)|Dec 8, 2021

Topic Modeling被引用 242

一句话总结

本文分析将 Transformer 语言模型规模扩展到 280B 参数（Gopher），在 MassiveText 上训练，并在 152 个任务上评估，考察大规模模型的毒性、偏见与安全影响。

ABSTRACT

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

研究动机与目标

通过探索规模对不同任务上性能的影响，激发构建大规模语言模型的动力。
描述用于训练 Gopher 及其家族的数据集、架构、训练计划和基础设施。
描述在阅读、知识与科学领域中，通过扩展规模带来的性能提升。
在模型规模增加时，研究毒性、偏见和安全性考虑，以及对下游伤害的影响与缓解措施的设计。

提出的方法

使用基于 Transformer 的自回归模型，采用 RMSNorm 和相对位置编码。
在 300B tokens、2048 上下文窗口，训练六个参数规模从 44M 到 280B 的模型。
在 MassiveText 上训练，该数据集是一个经过筛选的多来源英文数据集，进行质量过滤和去重。
在 152 个任务上评估，涵盖语言建模、阅读理解、事实核查、问答、常识、MMLU 与 BIG-bench。
通过 RealToxicityPrompts 提示语和 Perspective API 分析毒性，并评估偏见与方言代表性。

实验结果

研究问题

RQ1模型规模（参数和计算）如何影响跨广泛 NLP 任务的性能？
RQ2哪些任务类别从扩展中获益最大，规模对数学和推理等领域的影响有限在何处？
RQ3更大规模如何影响毒性生成和毒性分类能力？
RQ4部署如 Gopher 这样的超大语言模型的安全性与偏见含义是什么，如何设计缓解措施？

主要发现

Gopher（280B）在 152 个基准测试中评估的约 81% 的任务上超过了先前的最先进模型。
规模在知识密集型任务（如阅读理解、事实核查）和一般知识方面带来显著提升，在数学/逻辑推理方面提升较小。
在 RACE 阅读理解任务中，Gopher 在高中水平任务接近人类表现，在中学任务上超越 GPT-3。
大规模模型提升毒性检测，但在用有毒提示进行提示时会产生更具毒性的输出，凸显安全权衡的细微性。
相较于更小的 Gopher 模型，大多数任务上表现有所提升，医学、科学、技术、社会科学和人文学科方面的提升尤为明显；某些推理任务从扩展中获益有限。
相较于 SOTA 基线，Gopher 在许多基准上常接近或超过 SOTA，但在复杂领域仍低于人类专家表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。

[论文解读] Scaling Language Models: Methods, Analysis &amp; Insights from Training Gopher