QUICK REVIEW

[论文解读] Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

Yiming Cui, Ziqing Yang|arXiv (Cornell University)|Apr 17, 2023

Natural Language Processing Techniques被引用 71

一句话总结

本工作在 LLaMA 上扩展了 20,000 个中文词元，使用 LoRA 实现高效的训练与微调，并在中文 LLaMA/Alpaca 模型中展示了更强的中文理解和对指令的执行能力。

ABSTRACT

Large Language Models (LLMs), such as ChatGPT and GPT-4, have dramatically transformed natural language processing research and shown promising strides towards Artificial General Intelligence (AGI). Nonetheless, the high costs associated with training and deploying LLMs present substantial obstacles to transparent, accessible academic research. While several large language models, such as LLaMA, have been open-sourced by the community, these predominantly focus on English corpora, limiting their usefulness for other languages. In this paper, we propose a method to augment LLaMA with capabilities for understanding and generating Chinese text and its ability to follow instructions. We achieve this by extending LLaMA's existing vocabulary with an additional 20,000 Chinese tokens, thereby improving its encoding efficiency and semantic understanding of Chinese. We further incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets, significantly enhancing the model's ability to comprehend and execute instructions. Our experimental results indicate that the newly proposed model markedly enhances the original LLaMA's proficiency in understanding and generating Chinese content. Additionally, the results on the C-Eval dataset yield competitive performance among the models with several times the size of ours. We have made our pre-trained models, training scripts, and other resources available through GitHub, fostering open research for our community. Chinese LLaMA series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca} and Chinese Llama-2 series: \url{https://github.com/ymcui/Chinese-LLaMA-Alpaca-2}

研究动机与目标

提升 LLaMA 和 Alpaca 在中文语言理解与生成方面的能力。
提高中文文本的编码效率和语义理解。
通过参数高效微调（LoRA）实现成本效益的训练与适应。
向社区提供预训练的中文 LLaMA/Alpaca 资源。

提出的方法

将 LLaMA 词汇扩展 20,000 个中文词元，并与原始分词器合并以创建中文 LLaMA 分词器（词汇量 49,953）。
调整嵌入矩阵以容纳扩展后的词汇表，同时不改变原始词元的嵌入向量。
对注意力和 MLP 层应用低秩自适应（LoRA）适配器，以实现参数高效的预训练和微调。
在中文语料上使用标准的 Casual Language Modeling (CLM) 对中文 LLaMA 进行预训练（20 GB 基本，120 GB Plus）。
使用模板化提示、遵循 Alpaca 范式，结合带有 49,954 词汇量的中文指令数据进行有监督微调（SFT）。
在指令执行任务和自然语言理解任务（C-Eval）上进行评估，采用基于 GPT-4 的评分和人工检查。

实验结果

研究问题

RQ1将 LLaMA 的词汇扩展 20k 中文词元是否可以提高中文编码效率和生成质量？
RQ2LoRA 是否能在有限计算资源下实现对中文 LLaMA/Alpaca 模型的高效训练和微调？
RQ3与基线 LLaMA/Alpaca 相比，中文 LLaMA 与中文 Alpaca 在指令执行和 NLU 基准上的表现如何？
RQ4数据规模（20 GB vs 120 GB）对中文模型性能有何影响？
RQ5解码策略和评估方法对中文任务中模型评估的影响是什么？

主要发现

中文 LLaMA 分词器显著缩短编码长度；该分词器产生的 token 数约为原始的一半，实质上等于翻倍上下文使用并加速生成。
基于 LoRA 的训练在所有中文 LLaMA/Alpaca 变体中实现了参数高效的预训练和微调，训练重点在于注意力和 MLP 组件。
中文 Alpaca 模型（Plus 变体）在多项任务的指令执行上优于基础模型，并在多项指标上表现更佳，较大尺寸的 Plus 模型在若干任务上的 GPT-4 等级评分普遍更高。
Alpaca-33B 在数值推理、编码和伦理处理方面比 Plus-7B/Plus-13B 表现更强，但在文本生成和多轮对话方面可能落后于 Plus 系列模型，原因在于数据量和模型规模的交互。
在 C-Eval 自然语言理解基准上，论文报告了与更大模型相对竞争的性能，显示所提出的词汇和训练方法对中文能力的高效提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。