QUICK REVIEW

[论文解读] PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

Wei Zeng, Xiaozhe Ren|arXiv (Cornell University)|Apr 26, 2021

Topic Modeling参考文献 39被引用 94

一句话总结

PanGu-α 在 2048 Ascend 910 处理器上，使用五维自并行训练最高达到 200B-parameter 的中文自回归语言模型，使用 1.1TB 高质量中文语料库，并在中文 NLP 任务中展示少样本/零样本能力。

ABSTRACT

Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with extit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named PanGu-$α$, with up to 200 billion parameters. PanGu-$α$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-$α$, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-$α$ in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-$α$ in performing various tasks under few-shot or zero-shot settings.

研究动机与目标

推动对中文预训练语言模型的扩展，超越以英语为中心的工作。
开发一个在下一字预测上增加查询层的 Transformer 基于自回归模型。
从多来源构建高质量的 1.1TB 中文语料库，并对其进行预训练的预处理。
展示使用 MindSpore Auto-parallel 在多设备上进行可扩展分布式训练。
评估在不同中文 NLP 任务中的少样本和零样本表现。

提出的方法

在顶部增加一个查询层的单向 Transformer 解码器，用于预测下一个 token。
在 1.1TB 中文语料库上训练 PanGu-α 模型，参数量为 2.6B、13B 和 200B。
在 MindSpore Auto-parallel 中应用五维并行（数据、模型对齐级、流水线模型、优化器模型、重记忆）并结合拓扑感知调度。
将模型和数据分区到 2048 个 Ascend 910 处理器，并为 Q/K/V 及输入采用特定分片策略。
使用 40k BPE 分词器和 1024 序列长度进行预训练，目标函数采用交叉熵进行下一个 token 的预测。
通过人工与模型评估相结合的方式评估数据质量，包括困惑度作为数据质量代理指标。

实验结果

研究问题

RQ1PanGu-α 在参数量和数据规模方面对中文语言建模的扩展程度如何？
RQ2五维 Auto-parallel 能否在大型 GPU/CPU 集群上实现对 200B 参数模型的高效训练？
RQ3模型规模对困惑度以及在中文 NLP 任务中的少样本/零样本表现有何影响？
RQ4在大规模中文预训练数据中，哪些数据筛选与预处理策略能产出高质量的数据？
RQ5PanGu-α 在摘要、问答、对话等任务上的生成能力和少样本能力表现如何？

主要发现

随着模型规模的扩大，PanGu-α 模型在困惑度上显示出更低的值（2.6B: 19.33；13B: 17.69；200B: 15.59，在验证集上）。
200B 模型在训练过程中收敛到大约 2.49 的损失，表明通过更多训练仍有进一步提升的潜力。
更大型的 PanGu-α 模型在不同中文 NLP 任务的少样本/零样本设置中实现更强的性能。
从 80TB 原始数据中构建了 1.1TB 的中文语料库，采用基于规则的清洗、基于模型的筛选和去重。
五维并行实现了在 2048 个 Ascend 910 处理器上端到端的训练，并具有拓扑感知的调度。
作者在 MindSpore 上提供了开源的 Auto-parallel 工具，便于实现类似的大规模预训练设置。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。