QUICK REVIEW

[论文解读] LLaMA Beyond English: An Empirical Study on Language Capability Transfer

Jun Zhao, Zhihao Zhang|arXiv (Cornell University)|Jan 2, 2024

Topic Modeling被引用 6

一句话总结

论文研究如何将 LLaMA 的语言生成和遵循指令能力转移到非英语语言，发现词汇扩展通常是多余的，并在不到 1% 的额外预训练数据下实现了接近于最先进水平的迁移。

ABSTRACT

In recent times, substantial advancements have been witnessed in large language models (LLMs), exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model's response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.

研究动机与目标

评估是否需要词汇扩展、进一步的预训练和指令微调来实现 LLaMA 的非英语语言迁移。
量化将能力迁移到非英语语言所需的预训练和指令数据量。
在多种非英语基准上评估知识水平与回答质量。
研究迁移过程中的跨语言对齐与代码切换现象。

提出的方法

以 LLaMA、LLaMA2 及中文适配变体作为基线，覆盖不同预训练规模。
是否扩展词汇表以评估其对迁移的影响。
在中文上进行进一步的预训练，规模高达 100B 标记。
使用 BELLE（中文）和 Bactrain-X（52 语言）数据集进行指令微调。
使用 C-Eval、MMLU、AGI-Eval、GAOKAO-Bench 评估知识迁移，使用 LLM-Eval 在 17 个类别中评估回答质量。

实验结果

研究问题

RQ1在数十亿级别的预训练标记下，词汇扩展是有助于还是妨碍非英语迁移？
RQ2需要多大规模的进一步预训练和指令数据以提升目标语言的知识对齐和回答质量？
RQ3非英语迁移如何影响模型原有的英语能力，是否多语言联合训练可以缓解任何退化？
RQ4通过迁移过程中的代码切换等现象能否证明在预训练期间学到了跨语言对齐？

主要发现

词汇扩展在数十亿级训练规模下不是有利选择；使用原始词汇表的 0.5B 中文标记的模型在 >30B 标记的扩展词汇模型上具备更好表现。
进一步预训练达到 100B 标记在少量指令微调数据下提升回答质量，但 100B+ 可能不足以显著提升知识水平。
回应质量的提升来自指令微调，仅需数十万条指令数据，而非大规模预训练。
纯中文迁移训练会降低英语能力，除非使用多语言联合训练来缓解损失。
在基准测试（C-Eval、GAOKAO-Bench、MMLU、AGI-Eval）和 LLM-Eval 上，该方法在训练数据<1%的条件下实现与最先进非英语大型语言模型相当的知识与回答质量；结果扩展到 13 种低资源语言。
迁移过程中的代码切换行为（约 2%–5% 的样本）表明在预训练期间学到了跨语言语义对齐。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。