Skip to main content
QUICK REVIEW

[논문 리뷰] LLaMA Beyond English: An Empirical Study on Language Capability Transfer

Jun Zhao, Zhihao Zhang|arXiv (Cornell University)|2024. 01. 02.
Topic Modeling인용 수 6
한 줄 요약

본문은 LLaMA의 언어 생성 및 지시 수행 능력을 비영어 언어로 이전하는 방법을 조사하며, 어휘 확장이 종종 불필요하고 추가 pretraining 데이터의 1% 미만으로도 상태‑유사한 transfer를 달성한다는 것을 발견한다.

ABSTRACT

In recent times, substantial advancements have been witnessed in large language models (LLMs), exemplified by ChatGPT, showcasing remarkable proficiency across a range of complex tasks. However, many mainstream LLMs (e.g. LLaMA) are pretrained on English-dominant corpus, which limits their performance in other non-English languages. In this paper, we focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. To answer this question, we conduct an extensive empirical investigation based on LLaMA, accumulating over 1440 GPU hours. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. To accurately assess the model's level of knowledge, we employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench. Furthermore, a comprehensive evaluation of the model's response quality is conducted, considering aspects such as accuracy, fluency, informativeness, logical coherence, and harmlessness, based on LLM-Eval, a benchmarks consisting instruction tasks from 17 diverse categories. Our evaluation results demonstrate that comparable performance to state-of-the-art transfer models can be achieved with less than 1% of the pretraining data, both in terms of knowledge alignment and response quality. Furthermore, the experimental outcomes across the thirteen low-resource languages also exhibit similar trends. We anticipate that the conclusions revealed by the experiments will aid the community in developing non-English LLMs.

연구 동기 및 목표

  • Assess whether vocabulary extension, further pretraining, and instruction tuning are needed for non-English language transfer from LLaMA.
  • Quantify how much pretraining and instruction data are required to transfer capabilities to non-English languages.
  • Evaluate knowledge level and response quality across multiple benchmarks in non-English languages.
  • Investigate cross-lingual alignment and code-switching phenomena during transfer.

제안 방법

  • Use LLaMA, LLaMA2, and Chinese-adapted variants as baselines with varying pretraining scales.
  • Extend vocabulary or not to assess its impact on transfer.
  • Perform further pretraining in Chinese with scales up to 100B tokens.
  • Apply instruction tuning using BELLE (Chinese) and Bactrain-X (52-language) datasets.
  • Evaluate knowledge transfer using C-Eval, MMLU, AGI-Eval, GAOKAO-Bench and response quality using LLM-Eval across 17 categories.

실험 결과

연구 질문

  • RQ1Does vocabulary extension help or hinder non-English transfer at tens of billions of pretraining tokens?
  • RQ2What scale of further pretraining and instruction data is required to improve knowledge alignment and response quality in the target language?
  • RQ3How does non-English transfer affect the model’s original English capabilities, and can multilingual joint training mitigate any degradation?
  • RQ4Is cross-lingual alignment learned during pretraining evidenced by phenomena such as code-switching during transfer?

주요 결과

  • Vocabulary extension is not a favorable choice for transfer at training scales of tens of billions of tokens; 0.5B Chinese tokens with original vocabulary outperforms extended-vocabulary models pretrained on >30B tokens.
  • Further pretraining up to 100B tokens improves response quality with low instruction-tuning data, but 100B+ may be insufficient to significantly raise knowledge level.
  • Responding quality gains from instruction tuning require only hundreds of thousands of instruction data, not large-scale pretraining.
  • Exclusive Chinese transfer training degrades English capabilities unless multilingual joint training is used, which mitigates the loss.
  • On benchmarks (C-Eval, GAOKAO-Bench, MMLU, AGI-Eval) and LLM-Eval, the approach achieves comparable knowledge and response quality to state-of-the-art non-English LLMs using <1% of the training data; results extend to 13 low-resource languages.
  • Code-switching behavior observed during transfer (approximately 2%–5% of samples) suggests cross-lingual semantic alignment learned during pretraining.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.