QUICK REVIEW

[论文解读] ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, :|arXiv (Cornell University)|Jun 18, 2024

Topic Modeling被引用 175

一句话总结

ChatGLM 提出一系列 LLM，最终形成 GLM-4 和 GLM-4 All Tools，在英语和中文基准上取得强劲表现，并实现对复杂任务的自主工具使用。

ABSTRACT

We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through https://github.com/THUDM and https://huggingface.co/THUDM.

研究动机与目标

评估 GLM-4 和 GLM-4 All Tools 在标准学术基准和长上下文任务上的表现。
描述预训练、对齐与架构决策如何提升中英能力。
在多项基准上评估指令遵循、对齐与安全性等方面。
展示 All Tools 能力用于自主工具使用（网页、Python、文本生成模型）和代理任务。

提出的方法

描述预训练数据组成和分词策略（十万亿令牌，双语重点）。
解释架构选择（除了 QKV 外无偏置、RMSNorm、SwiGLU、RoPE2D、Group Query Attention）以及上下文长度扩展至 128K/1M。
概述多阶段后训练对齐（SFT、RLHF）和数据质量控制。
总结 All Tools 集成，包括网页浏览器、Python 解释器、文本到图像模型，以及用户定义函数。
描述在基准上的评估设置（MMLU、GSM8K、MATH、BBH、GPQA、HumanEval、AlignBench、LongBench-Chat、NCB、Berkeley Function Call Leaderboard、AgentBench）。

Figure 1 : The timeline of the GLM family of language, code, vision, and agent models. The focus of this report is primarily on the language models, i.e., ChatGLM. The APIs are publicly available at https://bigmodel.cn and open models can be accessed through https://github.com/THUDM .

实验结果

研究问题

RQ1GLM-4 和 GLM-4 All Tools 在标准基准上与 GPT-4 和 Claude 的接近程度如何？
RQ2GLM-4 的中文对齐和长上下文能力能否与竞争模型匹配或超越？
RQ3架构创新和长上下文训练对性能和效率有何影响？
RQ4GLM-4 All Tools 在自主工具使用和代理任务方面有多高的效能？
RQ5与最先进模型相比，GLM-4 的安全性和风险概况如何？

主要发现

模型	MMLU	GSM8K	MATH	BBH	GPQA	HumanEval
GLM-4-9B-Chat	72.4	79.6	50.6	76.3	28.8	71.8
GLM-4-Air (0605)	81.9	90.9	57.9	80.4	38.4	75.7
GLM-4 (0520)	83.3	93.3	61.3	84.7	39.9	78.5

GLM-4 (0520) 在 MMLU 83.3、GSM8K 93.3、MATH 61.3、BBH 84.7、GPQA 39.9、HumanEval 78.5，在许多基准上接近 GPT-4 Turbo 和 Claude 3 Opus。
在指令遵循方面，GLM-4-0520 在提示/指令设定与中文翻译提示方面与 GPT-4 Turbo 高相似度。
GLM-4 在 AlignBench 的中文对齐超过 GPT-4，并且 GLM-4 128K 上下文长度在长上下文任务（LongBench-Chat）上可与 GPT-4 Turbo 和 Claude 3 Opus 相匹配。
GLM-4 All Tools 能自主选择并使用工具（网页浏览器、Python 解释器、文本到图像模型）来完成复杂任务，并在实际信息获取和数学求解方面常常优于 GPT-4 All Tools。
GLM-4-9B-Chat 与 GLM-4-Air 提供具有竞争力的性能，同时延迟和成本较低，且具备长上下文扩展（128K/1M）以及代码/问题求解能力。
在安全方面，GLM-4 在 SafetyBench 的大多数维度上显示出有竞争力的分数，接近 Claude 3 Opus，并在总体安全性方面接近 GPT-4 家族。

Figure 2 : An Illustrative Example of GLM-4 All Tools.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。