QUICK REVIEW

[논문 리뷰] ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, :|arXiv (Cornell University)|2024. 06. 18.

Topic Modeling인용 수 175

한 줄 요약

ChatGLM은 GLM-4 및 GLM-4 All Tools로 정점에 이르는 가족 LLM을 제시하며, 영어 및 중국어 벤치마크에서 강력한 성능을 달성하고 복잡한 작업을 위한 자동 도구 사용을 가능하게 합니다.

ABSTRACT

We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through https://github.com/THUDM and https://huggingface.co/THUDM.

연구 동기 및 목표

GLM-4 및 GLM-4 All Tools의 표준 학술 벤치마크와 긴 맥락 작업에서의 성능 평가.
향상된 중국어 및 영어 능력을 가능하게 하는 사전 학습, 정렬 및 아키텍처 결정 설명.
다양한 벤치마크에서 지시 수행, 정렬 및 안전 측면 평가.
웹, 파이썬, 이미지 생성 등 자율 도구 사용 및 에이전트 작업을 위한 All Tools 기능 시연.

제안 방법

사전 학습 데이터 구성 및 토크나이제이션 전략 설명(10조 토큰, 이중언어 집중).
바이패스가 아닌 QKV, RMSNorm, SwiGLU, RoPE2D, Group Query Attention 등의 아키텍처 선택 설명 및 컨텍스트 길이 128K/1M으로의 확장.
SFT, RLHF를 포함한 다단계 포스트-트레이닝 정렬 및 데이터 품질 관리 outline.
웹 브라우저, Python 해석기, 텍스트-투-이미지 모델, 사용자 정의 함수 등 All Tools 통합 요약.
벤치마크(MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, AlignBench, LongBench-Chat, NCB, Berkeley Function Call Leaderboard, AgentBench) 전반의 평가 설정 설명.

Figure 1 : The timeline of the GLM family of language, code, vision, and agent models. The focus of this report is primarily on the language models, i.e., ChatGLM. The APIs are publicly available at https://bigmodel.cn and open models can be accessed through https://github.com/THUDM .

실험 결과

연구 질문

RQ1GLM-4 및 GLM-4 All Tools가 표준 벤치마크에서 GPT-4 및 Claude에 얼마나 근접했는가?
RQ2GLM-4의 중국어 정렬 및 긴 맥락 능력이 경쟁 모델에 비해 일치하거나 이를 능가할 수 있는가?
RQ3아키텍처 혁신 및 긴 맥락 학습이 성능과 효율성에 어떤 영향을 미치는가?
RQ4GLM-4 All Tools의 자율 도구 사용 및 에이전트 작업의 효과성은 어떠한가?
RQ5안전성 및 위험 프로파일이 최첨단 모델들과 비교하여 어떤가?

주요 결과

모델	MMLU	GSM8K	MATH	BBH	GPQA	HumanEval
GLM-4-9B-Chat	72.4	79.6	50.6	76.3	28.8	71.8
GLM-4-Air (0605)	81.9	90.9	57.9	80.4	38.4	75.7
GLM-4 (0520)	83.3	93.3	61.3	84.7	39.9	78.5

GLM-4 (0520) 은 MMLU 83.3, GSM8K 93.3, MATH 61.3, BBH 84.7, GPQA 39.9, HumanEval 78.5로, 많은 벤치마크에서 GPT-4 Turbo 및 Claude 3 Opus에 근접합니다.
지시 수행에서, GLM-4-0520은 프롬프트/지시 설정에서 GPT-4 Turbo(2024-04-09)와 중국어 번역 프롬프트에서 GPT-4 Turbo와 높은 유사성을 보였습니다.
GLM-4는 AlignBench에서 중국어 정렬에서 GPT-4를 능가하거나 일치하고, GLM-4 128K 컨텍스트 길이는 LongBench-Chat에서 GPT-4 Turbo 및 Claude 3 Opus에 맞먹습니다.
GLM-4 All Tools는 웹 브라우저, Python 해석기, 텍스트-투-이미지 모델 등 도구를 자율적으로 선택·사용하여 복잡한 작업을 완료할 수 있으며, 실무 정보 접근 및 수학 풀이에서 종종 GPT-4 All Tools를 능가합니다.
GLM-4-9B-Chat 및 GLM-4-Air는 대기 시간 및 비용이 감소된 경쟁력 있는 성능을 제공하며, 긴 맥락 확장(128K/1M) 및 코드/문제 해결 능력을 갖추고 있습니다.
안전성 측면에서 GLM-4는 Safe tyBench의 대부분 차원에서 경쟁력 있는 점수를 보이며 Claude 3 Opus에 근접하고 overall safety 측면에서 GPT-4 계열에 근접합니다.

Figure 2 : An Illustrative Example of GLM-4 All Tools.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.