QUICK REVIEW

[论文解读] Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Xinrun Du, Zhouliang Yu|arXiv (Cornell University)|Apr 5, 2024

Natural Language Processing Techniques被引用 5

一句话总结

CT-LLM 是一个 2B 参数的 LLM，从头开始主要在中文数据上进行预训练（800B 个中文标记），在中文能力强劲、多语言性能具有竞争力，且拥有开源数据、CHC-Bench 评估，以及 SFT/DPO 对齐。

ABSTRACT

In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile language models.

研究动机与目标

证明一个以中文为中心的 LLM 在中文任务上能超越以英语为中心的基线。
提供高质量中文预训练语料（MAP-CC）并发布数据处理流程。
通过有监督的微调展示模型的多语言适应性和英文能力。

提出的方法

在一个 1,254.68B 标记的混合数据集上对 CT-LLM 进行预训练，包含 800B 的中文标记、300B 的英文标记和 100B 的代码标记。
使用具有 32 层的 transformer 解码器架构，隐藏维度 2,048，注意力头 16，以及 4,096 的上下文长度。
应用旋转位置嵌入、SwiGLU 激活、RMSNorm，以及用于效率的共享输入-输出嵌入。
采用中文分词器（baichuan2）并采用 BPE，词汇表大小为 125,696，并对数字采用逐位标记。
进行有监督微调（SFT），使用中文和英文数据，并通过困惑度用 Qwen-7B 作为评估者进行筛选。
通过 DPO 使用混合中文/英文偏好数据集进行偏好优化，以与人类偏好对齐。

实验结果

研究问题

RQ1以中文为中心的预训练制度是否能在没有英语为主的数据的情况下实现强大的中文语言理解和生成？
RQ2SFT 和 DPO 对 CT-LLM 的中文能力及多语言能力有何影响？
RQ3MAP-CC 数据预处理对模型质量有何影响？
RQ4与其他 2B 模型相比，CT-LLM 在 CHC-Bench 的中文指令理解与执行表现如何？
RQ5CT-LLM-SFT-DPO 相对于基线的安全性与对齐特征如何？

主要发现

CT-LLM 由于强调中文内容的数据混合而在中文语言能力方面获得显著提升。
CT-LLM 在跨学科任务上表现均衡，与某些以英语为中心的模型相比，英中差距在多领域任务上更小。
SFT-DPO 对齐在相对于基线的安全性和偏好驱动的回答方面有所提升。
CT-LLM 在 CHC-Bench 的中文指令执行方面展示出具竞争力或更优的表现。
CT-LLM-SFT-DPO 即使进行中文为中心的预训练，在英文基准测试上也保持强劲表现。
实验结果显示该 2B 模型具备更好的中文能力和具有竞争力的多语言适应性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。