QUICK REVIEW

[论文解读] GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng, Xiao Liu|arXiv (Cornell University)|Oct 5, 2022

Advanced Neural Network Applications被引用 294

一句话总结

GLM-130B 是一个130B参数的英中双语预训练模型，开源，旨在在许多英语基准测试中超越 GPT-3，在中文方面超过 ERNIE Titan 3.0，并且 INT4 量化使在负担得起的 GPU 上进行推理成为可能。

ABSTRACT

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$ imes$RTX 3090 (24G) or 8$ imes$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

研究动机与目标

展示以透明性和实用性训练一个开放的 100B 规模的双语大型语言模型。
展示 GLM-130B 在英语基准测试上超越 GPT-3、并在与 PaLM 540B 的比较中具有竞争力。
在中文基准测试上评估 GLM-130B，并与 ERNIE Titan 3.0 260B 进行比较。
开发训练稳定性和平台感知策略，以实现负担得起的推理。

提出的方法

采用 GLM 双向自回归空白填充目标，使用 [MASK] 和 [gMASK] 标记。
使用基于 DeepNorm 的 Post-LN，并结合特定初始化以稳定训练。
采用混合精度训练，前向/后向为 FP16，优化器状态为 FP32；应用嵌入梯度收缩以稳定嵌入。
在 1.2T English、1.0T Chinese 的 WudaoCorpora，以及 250G 额外中文数据上进行预训练，总计约 2.45T tokens。
在预训练期间结合 5% 的多任务指令预训练（MIP），使用 74 个提示数据集。
配置 3D 并行化（4 路张量，8 路流水线）以及 60 天的 DGX-A100 训练，覆盖 400B tokens。

实验结果

研究问题

RQ1100B 规模的开放式双语 LLM 是否能够在英语基准的零-shot 和少样本学习中表现优于 GPT-3 与 BLOOM/OPT 系列？
RQ2GLM-130B 的双向 GLM 架构是否相对于仅解码模型在语言理解任务上带来改进？
RQ3哪些训练稳定性策略（如 DeepNorm、EGS）对大规模双语 LLM 的预训练有效，以及它们如何影响性能和可及性？
RQ4INT4 权重量化是否能够在消费级 GPU 上实现负担得起的推理，而不会显著降低性能？
RQ5GLM-130B 在中文基准（CLUE、FewCLUE）上的表现与 ERNIE Titan 3.0 260B 相比如何？

主要发现

GLM-130B 在112项任务的广泛英语基准上超越 GPT-3 175B。
在双向注意力下，零-shot LAMBADA 精度达到 80.2%，刷新纪录。
GLM-130B 在许多情况下超过 PaLM 540B，在中文 CLUE 任务上也超过 ERNIE Titan 3.0 260B。
INT4 权重量化使得在 4× RTX 3090 (24G) 或 8× RTX 2080 Ti (11G) 上推理几乎无性能损失。
GLM-130B 在少样本设定下在 MMLU 上取得强劲结果，并在 BIG-bench-lite 零-shot 任务上表现良好。
该模型展示了INT4量化的缩放定律，在各基准测试中保持稳定表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。