QUICK REVIEW

[論文レビュー] AfroBench: How Good are Large Language Models on African Languages?

Jessica Ojo, Ogundepo, Odunayo|arXiv (Cornell University)|Nov 14, 2023

Topic Modeling被引用数 8

ひとこと要約

この論文は GPT-4、mT0、LLaMa 2 を 30 のアフリカ諸語で5つの NLP タスクに対して評価し、高資源言語との大きなパフォーマンス差と、タスク依存的な強み・弱みを明らかにしている。

ABSTRACT

Large-scale multilingual evaluations, such as MEGA, often include only a handful of African languages due to the scarcity of high-quality evaluation data and the limited discoverability of existing African datasets. This lack of representation hinders comprehensive LLM evaluation across a diverse range of languages and tasks. To address these challenges, we introduce AfroBench -- a multi-task benchmark for evaluating the performance of LLMs across 64 African languages, 15 tasks and 22 datasets. AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task. We present results comparing the performance of prompting LLMs to fine-tuned baselines based on BERT and T5-style models. Our results suggest large gaps in performance between high-resource languages, such as English, and African languages across most tasks; but performance also varies based on the availability of monolingual data resources. Our findings confirm that performance on African languages continues to remain a hurdle for current LLMs, underscoring the need for additional efforts to close this gap. https://mcgill-nlp.github.io/AfroBench/

研究の動機と目的

複数の NLP タスクにおけるアフリカ諸語での人気大型言語モデルの性能を評価する。
LLM の prompting 結果を完全監督付きベースラインおよび高資源言語ベンチマークと比較する。
タスク依存的な強み/弱みとアフリカ諸語の性能に影響を与える言語要因を特定する。
LLM 開発におけるアフリカ諸語の表現に関する示唆を浮き彫りにする。

提案手法

5 タスクで 30 アフリカ諸語を対象に、三つの LLM（mT0、LLaMa 2、GPT-4）をゼロショット prompting で評価する。
ターゲット言語テキストをアフリカ諸語で用い、英語プロンプトをゼロショットのクロスリンガル設定で使用する。
LLM の結果を最先端の監督付きベースラインおよび高資源言語の性能と比較する。
分類、QA、NER、MT のタスク別および言語ファミリ/場所で結果を分析する。
評価には MasakhaNEWS、AfriSenti、MasakhaNER、AfriQA、MAFAND-MT のタスクデータセットを活用する。

実験結果

リサーチクエスチョン

RQ1GPT-4、mT0、LLaMa 2 は五つの NLP タスクを通じてアフリカ諸語でどのような性能を示すか？
RQ2これらの LLM にとってアフリカ諸語と高資源言語との間にどの程度の性能差があるか？
RQ3どのタスクが LLM の性能が低下しやすく、なぜそうなるのか？
RQ4アフリカ諸語に対して最も強いクロスリンガル転移またはクロスリンガル QA 能力を示すモデルはどれか？
RQ5観察された性能差を説明する要因（プロンプティング、多言語微調整、事前学習データ）は何か？

主な発見

GPT-4 はニューストピック分類と感情分類で SOTA の約80% の性能を達成。
GPT-4 の翻訳/生成タスク（MT）は分類タスクと比べて低い性能。
mT0 はクロスリンQ A で最良の性能を示し、微調整済み mT5 のような監督ベースラインを上回る。
mT0-13B-MT は多言語プロンプトの恩恵を受け、競争力のある MT 結果に近づく。
LLaMa 2 は英語中心の事前学習のため評価モデルの中で最も低い性能を示す。
全タスクを通じて、すべてのモデルは高資源言語の性能には及ばず、MT と QA で特に大きなギャップがある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。