QUICK REVIEW

[論文レビュー] Tamil-Llama: A New Tamil Language Model Based on Llama 2

Anitha Balachandran|arXiv (Cornell University)|Nov 10, 2023

Topic Modeling被引用数 8

ひとこと要約

Extends LLaMA 2 with 16k Tamil tokens, uses LoRA for efficient training, and releases Tamil-focused Alpaca/OpenOrca instruction data to improve Tamil generation and understanding.

ABSTRACT

Language modeling has witnessed remarkable advancements in recent years, with Large Language Models (LLMs) like ChatGPT setting unparalleled benchmarks in human-like text generation. However, a prevailing limitation is the underrepresentation of languages like Tamil in these cutting-edge models, leading to suboptimal performance in diverse linguistic contexts. This paper addresses this lacuna, enhancing the open-source LLaMA model with an addition of 16,000 Tamil tokens, aiming to achieve superior text generation and comprehension in the Tamil language. We strategically employ the LoRA methodology for efficient model training on a comprehensive Tamil corpus, ensuring computational feasibility and model robustness. Moreover, we introduce a Tamil-translated version of the Alpaca dataset and a subset of the OpenOrca dataset tailored for instruction fine-tuning. Our results showcase significant performance improvements in Tamil text generation, with potential implications for the broader landscape of LLMs in Indian languages. We further underscore our commitment to open research by making our models, datasets, and code publicly accessible, fostering further innovations in language modeling.

研究の動機と目的

Open-source llm におけるタミル語の過小 representationに対処するため、LLaMA 2 の語彙にタミル語トークンを追加する。
タミル語コーパスを用いて LoRA を用いた効率的な学習を行い、タミル-LLaMA モデルを訓練する。
タミル翻訳された Alpaca および OpenOrca の instruction データセットを作成し、タミルのファインチューニングを実施する。
指示に従う能力、推論、翻訳、自然言語理解（NLU）タスクにおけるタミル-LLaMA を評価し、ベースラインモデルに対する改善を示す。

提案手法

タミル SentencePiece トークナイザを用いて16,000 のタミルトークンを追加し、LLaMA 2 の語彙を拡張する。
元の32,000トークンの語彙と16,000のタミルトークンを組み合わせて48,000トークンの語彙を形成する。
fp16 と LoRA アダプタを用いて、タミル語コーパス上で因果言語モデリングを前訓練する（セットアップ間で6–12 GBの範囲）。
LoRA を用いて FP16 で、翻訳済み Alpaca および OpenOrca データセットとタミル Wikipedia 起源データセットを併用して、指示遵守モデルをファインチューニングする。
GPT-4 ベースのスコアリングと手動レビューを補完して、120超のタミル指示プロンプトで評価する。
7B と 13B のタミル-LLaMA モデルを gpt-3.5-turbo と複数タスクで比較する。

実験結果

リサーチクエスチョン

RQ1タミル語の 16,000 トークンを LLaMA 2 に追加することは、タミル語の生成と理解を実質的に改善するか。
RQ2LoRA ベースの事前訓練とファインチューニングは、指示遵守タスクに適した効率的で堅牢なタミル-LLaMA モデルを生み出すか。
RQ3タミル翻訳済み Alpaca および OpenOrca データセットは、ベースラインモデルと比較してタミル指示チューニングの成果を改善するか。
RQ4タミル-LLaMA モデルは、英語中心の LLaMA 変種と比較してタミル NLU および翻訳ベンチマークでどう評価されるか。

主な発見

Task Type	Tamil-LLaMA-7B	Tamil-LLaMA-13B	gpt-3.5-turbo	Notes
Question Answering	77.00	75.33	54.33	GPT-4 rated sample scores (Table 3)
Open-ended QA	84.47	85.26	58.68	GPT-4 rated sample scores (Table 3)
Reasoning	47.50	64.25	63.50	GPT-4 rated sample scores (Table 3)
Literature	45.50	40.00	71.00	GPT-4 rated sample scores (Table 3)
Entertainment	43.33	50.00	60.00	GPT-4 rated sample scores (Table 3)
Creative Writing	92.50	95.62	59.69	GPT-4 rated sample scores (Table 3)
Translation	60.56	66.67	92.78	GPT-4 rated sample scores (Table 3)
Coding	63.57	76.07	57.14	GPT-4 rated sample scores (Table 3)
Ethics	23.75	57.50	40.00	GPT-4 rated sample scores (Table 3)
Overall	63.83	71.17	61.33	GPT-4 rated overall (Table 3)

タミル-LLaMA モデルは、GPT-4 により評価されたタミル指示遵守タスクでベースラインの LLaMA 2 を上回る。
GPT-4 ベースの評価では、Tamill-LLaMA-7B が gpt-3.5-turbo より総合スコアが高い（63.83 対 61.33）。
Tamill-LLaMA-13B は GPT-4 の総合スコアが 71.17 に達し、gpt-3.5-turbo（61.33）を上回る。
NLU ベンチマークでは、Tamill-LLaMA は IndicSentiment（81.3% 対ランダムの 50.5%）および IndicGLUE（80.12%）で元の LLaMA を大きく上回る。
翻訳タスクは Tamil から英語への性能が高く、Tamill-LLaMA は元の LLaMA 2 70B を超えるタミル翻訳を示し、gpt-3.5-turbo に近づく。
コード生成と推論タスクは、より大きなモデルからのタミル生成と比較して改善されているが、数学的推論は依然課題。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。