QUICK REVIEW

[논문 리뷰] Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

Kazuki Fujii, Taishi Nakamura|arXiv (Cornell University)|2024. 04. 27.

Natural Language Processing Techniques인용 수 5

한 줄 요약

Swallow, an enhanced Japanese-capable Llama-2–based LLM, 은 일본어 데이터에 대한 지속적 사전 학습(단어집 확장 포함)을 통해 구축되었으며, 100B 토큰까지 단조로운 증가를 보이고 일본어 작업에서 English/Japanese-from-scratch models를 능가한다.

ABSTRACT

Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results confirmed that the performance on Japanese tasks drastically improved through continual pre-training, and the performance monotonically increased with the amount of training data up to 100B tokens. Consequently, Swallow achieved superior performance compared to other LLMs that were trained from scratch in English and Japanese. An analysis of the effects of continual pre-training revealed that it was particularly effective for Japanese question answering tasks. Furthermore, to elucidate effective methodologies for cross-lingual continual pre-training from English to Japanese, we investigated the impact of vocabulary expansion and the effectiveness of incorporating parallel corpora. The results showed that the efficiency gained through vocabulary expansion had no negative impact on performance, except for the summarization task, and that the combined use of parallel corpora enhanced translation ability.

연구 동기 및 목표

영어로 학습된 LLM을 일본어로 지속적 사전 학습을 통해 효율적인 교차 언어 적응을 촉진한다.
일본어 데이터의 양과 모델 크기가 일본어 및 영어 작업에서 성능에 어떤 영향을 미치는지 정량화한다.
일본어 생성 및 번역을 개선하기 위한 어휘 확장과 병렬 말뭉치의 기법을 조사한다.
일본어에서 처음부터 학습한 모델과 비교하여 지속적 사전 학습이 얻는 이점을 평가한다.
일본어 맥락에서 교차 언어 지속적 사전 학습에 대한 실용적 지침을 제공한다.

제안 방법

Llama 2의 어휘를 일본어 부분어와 문자(VE)로 확장한다.
리플레이 전략을 사용하여 일본어가 약 90%이고 영어가 10%인 약 100B-token 혼합 데이터에 대해 지속적 사전 학습을 수행한다.
llm-jp-eval 및 LM Evaluation Harness를 사용하여 QA, RC, AS, AR, CR, MT를 포함한 여섯 가지 일본어 및 영어 작업에서 평가한다.
Swallow(7B/13B/70B)를 기반 Llama 2 변형 및 일본어에서 처음 학습한 모델과 비교한다.
VE 및 병렬 말뭉치가 작업 성능 및 번역 능력에 미치는 영향을 분석한다.
Flash Attention 2와 warmup이 있는 코사인 학습률 스케줄 및 AdamW 최적화를 사용한다.]

Figure 1: Relative change in performance of Swallow compared to $\mathtt{Llama\ 2}$ . Japanese tasks (left, see Table 2 for task details) improved by up to approximately 70%.

실험 결과

연구 질문

RQ1지속적 사전 학습이 영어에서 일본어로의 전이에서 모델 크기에 관계없이 일본어 작업 성능을 향상시키는가?
RQ2지속적 사전 학습에서 일본어 데이터의 양이 성능에 어떤 영향을 주며 단조로운 관계가 있는가?
RQ3어휘 확장이 성능과 효율성에 어떤 영향을 주는가?
RQ4병렬 일본어–영어 코퍼라를 도입하면 번역은 향상되지만 다른 작업에 미치는 영향은 어떤가?

주요 결과

Swallow는 평가된 작업에서 일본에서 개발된 일본어 모델 중 최고 성능을 달성한다(2023년 12월 기준).
지속적 사전 학습 후 일본어 성능이 Llama 2 변형 대비 평균 약 7포인트 향상된다.
일본어 QA 작업은 최대 약 75%의 향상을 보이고; MGSM 산술 추론은 36–63% 향상이다; 영어 QA/AR은 6–23% 저하를 보인다.
일본어로의 학습 데이터가 최대 약 100B 토큰까지 monotonically 증가하며, 가장 큰 이득은 초기(처음 20B 토큰)에서 나타난다.
어휘 확장은 일본어 작업 전반에 미미한 영향을 미치며, 자동 요약은 약간의 악화를 보인다(~5–15%).
병렬 코퍼스는 번역(En-Ja 9–24%, Ja-En 14–51%)를 크게 향상시키나, 비번역 작업에는 일관되게 개선을 주지 않는다.

Figure 2: Joint distribution of $\mathtt{Llama\ 2}$ (x-axis) and Swallow (y-axis) scores (character F1, with 1.0 representing an exact match) for NIILC questions.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.