QUICK REVIEW

[论文解读] Baichuan 2: Open Large-scale Language Models

A. Y. Yang, Bin Xiao|arXiv (Cornell University)|Sep 19, 2023

Topic Modeling被引用 125

一句话总结

Baichuan 2 提供开放、跨语言的大模型，具有 7B 和 13B 参数，在 2.6T 令牌上训练，在开源模型中具备竞争力或更优的表现，并在医学和法律等领域具有强劲的性能；包含已发布的检查点和与人类偏好对齐的聊天变体。

ABSTRACT

Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.

研究动机与目标

解决超越以英语为主的模型的开放多语言大模型需求。
扩大训练数据量和模型规模，以提升通用和领域特定的性能。
开发可实现高效大规模预训练与对齐的架构与训练优化。
发布模型检查点和聊天变体，以促进安全性、可重复性和研究协作。

提出的方法

提出 Baichuan 2，提供两个尺寸（7B 和 13B），在多语言数据上从头训练，使用 2.6T 令牌。
对 Transformer 架构进行修改，使用 SwiGLU 激活、LayerNorm/RMSNorm、内存高效注意力，以及增强的分词器（词汇表大小 125,696）。
对 Baichuan 2-7B 使用 RoPE，对 Baichuan 2-13B 使用 ALiBi，并通过 xFormers 优化注意力。
应用 NormHead 和 Max-z 损失以稳定训练并确保鲁棒推理。
采用张量并行和基于 ZeRO 的数据并行的分布式训练，以及内存切分技术和混合精度（BF16/Float32）以提高效率。
通过有监督微调（SFT）再结合使用 PPO 的 RLHF 来实现对齐，使用带多类别提示的奖励模型，以及一个用于聊天模型的 350 次迭代的策略优化。

实验结果

研究问题

RQ1与其他开源规模的 LLM 相比，Baichuan 2 在通用基准测试中的表现如何？
RQ2大规模预训练数据对多语言和领域特定能力的影响是什么？
RQ3架构与训练优化是否在 7B 与 13B 模型上带来可衡量的效率与稳定性提升？
RQ4对齐流程（SFT + RLHF）在生成安全且有用的聊天模型方面有多有效？
RQ5Baichuan 2 在像医学与法律这样的垂直领域的相对表现如何？

主要发现

模型	C-Eval	MMLU	CMMLU	Gaokao	AGIEval	BBH	GSM8K	HumanEval
GPT-4	68.40	83.93	70.33	66.15	63.27	75.12	89.99	69.51
GPT-3.5 Turbo	51.10	68.54	54.06	47.07	46.13	61.59	57.77	52.44
LLaMA-7B	27.10	35.10	26.75	27.81	28.17	32.38	9.78	11.59
LLaMA 2-7B	28.90	45.73	31.38	25.97	26.53	39.16	16.22	12.80
MPT-7B	27.15	27.93	26.00	26.54	24.83	35.20	8.64	14.02
Falcon-7B	24.23	26.03	25.66	24.24	24.10	28.77	5.46	-
ChatGLM 2-6B (base)	51.70	47.86	-	-	-	-	33.68	32.37	-
Baichuan 1-7B	42.80	42.30	44.02	36.34	34.44	32.48	9.17	9.20
Baichuan 2-7B-Base	54.00	54.16	57.07	47.47	42.73	41.56	24.49	18.29
LLaMA-13B	28.50	46.30	31.15	28.23	28.22	37.89	20.55	15.24
LLaMA 2-13B	35.80	55.09	37.99	30.83	32.29	46.98	28.89	15.24
Vicuna-13B	32.80	52.00	36.28	30.11	31.55	43.04	28.13	16.46
Chinese-Alpaca-Plus-13B	38.80	43.90	33.43	34.78	35.46	28.94	11.98	16.46
XVERSE-13B	53.70	55.21	58.44	44.69	42.54	38.06	18.20	15.85
Baichuan 1-13B-Base	52.40	51.60	55.30	49.69	43.20	43.01	26.76	11.59
Baichuan 2-13B-Base	58.10	59.17	61.97	54.33	48.17	48.78	52.77	17.07

Baichuan 2-7B-Base 与 Baichuan 2-13B-Base 在多项基准测试上优于同等规模的其他开源模型（例如 MMLU、CMMLU、GSM8K、HumanEval）。
Baichuan 2-7B-Base 在法律和医学领域取得了较强的分数，在某些中文任务上常常超越非 GPT-4 的基线，且接近 GPT-4。
Baichuan 2 在通用和领域基准测试中相较于 Baichuan 1 显示了显著提升，包括在 GSM8K 和 HumanEval 上几乎翻倍的结果。
Flores-101 的多语言评估显示 Baichuan 2-7B-Base 在全部七项任务上超越同侪；Baichuan 2-13B-Base 在若干任务上超越同侪，中文-英文能力在某些配对中接近 GPT-4。
随着 Baichuan 2 的使用，代码和数学能力显著提升，7B/13B 基础模型在各自领域超越了许多同代模型。
该项目提供从 200B 到 2.6T 令牌的开放模型检查点，以揭示训练动态并支持进一步研究。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。