QUICK REVIEW

[論文レビュー] Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities

Kazuki Fujii, Taishi Nakamura|arXiv (Cornell University)|Apr 27, 2024

Natural Language Processing Techniques被引用数 5

ひとこと要約

Swallow は、日本語対応を強化した Llama-2 ベースの LLM で、日本語データの継続的事前学習（語彙拡張付き）を経て 100B トークンまで単調な利益を示し、日本語タスクで英語/日本語からゼロショットで学習したモデルを上回る。

ABSTRACT

Cross-lingual continual pre-training of large language models (LLMs) initially trained on English corpus allows us to leverage the vast amount of English language resources and reduce the pre-training cost. In this study, we constructed Swallow, an LLM with enhanced Japanese capability, by extending the vocabulary of Llama 2 to include Japanese characters and conducting continual pre-training on a large Japanese web corpus. Experimental results confirmed that the performance on Japanese tasks drastically improved through continual pre-training, and the performance monotonically increased with the amount of training data up to 100B tokens. Consequently, Swallow achieved superior performance compared to other LLMs that were trained from scratch in English and Japanese. An analysis of the effects of continual pre-training revealed that it was particularly effective for Japanese question answering tasks. Furthermore, to elucidate effective methodologies for cross-lingual continual pre-training from English to Japanese, we investigated the impact of vocabulary expansion and the effectiveness of incorporating parallel corpora. The results showed that the efficiency gained through vocabulary expansion had no negative impact on performance, except for the summarization task, and that the combined use of parallel corpora enhanced translation ability.

研究の動機と目的

English で訓練された LLM を継続的 pre-training を通じて日本語へ適応させ、効率的な跨言語適応を動機づける。
継続的 pre-training が日本語と英語のタスクで性能に及ぼす影響を、日本語データ量とモデルサイズの観点で定量化する。
語彙拡張と平行コーパスを、日本語生成と翻訳の向上に寄与する技術として調査する。
継続的 pre-training が日本語でスクラッチから学習したモデルより性能を向上させるかを評価する。
日本語文脈での跨言語継続的 pre-training に関する実践的ガイドラインを提供する。

提案手法

Llama 2 の語彙を日本語のサブワードと文字（VE）で拡張する。
再生戦略を用いて、日本語テキストが約90%、英語が約10%の 100B トークン混合で継続的 pre-training を実施する。
llm-jp-eval と LM Evaluation Harness を用いて、QA、RC、AS、AR、CR、MT の6つの日本語・英語タスクを評価する。
Swallow（7B/13B/70B）を、ベースの Llama 2 変種および日本語-from-scratch モデルと比較する。
VE と平行コーパスがタスク性能と翻訳能力に与える影響を分析する。
Flash Attention 2 とコサイン lr スケジュール、ウォームアップ、AdamW 最適化を使用する。

Figure 1: Relative change in performance of Swallow compared to $\mathtt{Llama\ 2}$ . Japanese tasks (left, see Table 2 for task details) improved by up to approximately 70%.

実験結果

リサーチクエスチョン

RQ1継続的 pre-training を英語から日本語へ行うことで、モデルサイズを問わず日本語タスクの性能は改善されるか。
RQ2継続的 pre-training における日本語データ量は性能にどのように影響し、単調関係は存在するか。
RQ3語彙拡張は性能と効率性に対してどのような影響を与えるか。
RQ4平行日本語–英語コーパスを取り入れると翻訳能力は向上する一方で、他のタスクにはどのような影響があるか。

主な発見

Swallow は、日本国内で開発された日本語モデルの中で、評価タスク全般において最高の性能を達成している（2023年12月時点）。
継続的 pre-training 後、日本語の平均性能は Llama 2 変種より約7ポイント向上。
日本語の QA タスクは最大で約75%の改善、MGSM の算数的推論は36–63%の改善、英語の QA/AR は6–23%の低下。
日本語トレーニングデータが約100Bトークンまで増えると性能は単調に改善し、初期の20Bトークンで最大の利得を示す。
語彙拡張は日本語タスク全体への影響は小さく、自動要約は劣化（約5–15%）を示す。
平行コーパスは翻訳性能（En-Ja 9–24%、Ja-En 14–51%）を大きく向上させるが、翻訳以外のタスクには一貫した改善は見られない、混合または二段階の設定で効果。

Figure 2: Joint distribution of $\mathtt{Llama\ 2}$ (x-axis) and Swallow (y-axis) scores (character F1, with 1.0 representing an exact match) for NIILC questions.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。