QUICK REVIEW

[论文解读] MEGA: Multilingual Evaluation of Generative AI

Kabir Ahuja, Harshita Diddee|arXiv (Cornell University)|Mar 22, 2023

Topic Modeling被引用 14

一句话总结

MEGA 基准在 70 种语言的 16 个 NLP 任务上对生成式 LLMs 进行评测，比较 GPT-3.5、GPT-4 和 BLOOMZ 与微调基线，以评估多语言能力和提示策略。

ABSTRACT

Generative AI models have shown impressive performance on many Natural Language Processing tasks such as language understanding, reasoning, and language generation. An important question being asked by the AI community today is about the capabilities and limits of these models, and it is clear that evaluating generative AI is very challenging. Most studies on generative LLMs have been restricted to English and it is unclear how capable these models are at understanding and generating text in other languages. We present the first comprehensive benchmarking of generative LLMs - MEGA, which evaluates models on standard NLP benchmarks, covering 16 NLP datasets across 70 typologically diverse languages. We compare the performance of generative LLMs including Chat-GPT and GPT-4 to State of the Art (SOTA) non-autoregressive models on these tasks to determine how well generative models perform compared to the previous generation of LLMs. We present a thorough analysis of the performance of models across languages and tasks and discuss challenges in improving the performance of generative LLMs on low-resource languages. We create a framework for evaluating generative LLMs in the multilingual setting and provide directions for future progress in the field.

研究动机与目标

评估大型语言模型在多语言 NLP 任务上的表现相对于微调的 SOTA 模型的情况。
识别在生成式 LLMs 中表现出色或困难的语言与任务，重点关注低资源语言。
分析提示策略（单语提示、零-shot 跨语言、翻译-测试）及其对多语言表现的影响。
研究分词器质量和预训练数据等因素，并讨论多语言评估中的测试数据污染问题。

提出的方法

基准测试覆盖 70 种语言、五大任务族（分类、问答、序列标注、NLG、负责任 AI）的 16 个 NLP 数据集。
评估 OpenAI GPT-3.5（text-davinci-003、gpt-3.5-turbo）和 GPT-4（gpt-4-32k），再加上 BLOOMZ 作为基于提示的基线，以及若干微调基线（mBERT、mT5-base、XLM-R Large、TuLRv6 XXL、MuRIL 等）。
使用五个提示组件：指令、上下文示例、模板、Verbalizer、测试输入来构建提示。
比较提示策略：单语提示、零-shot 跨语言、翻译-测试；为一致性应用基于 PromptSource 的英语模板。
使用英语验证数据对提示进行微调，并将选定的提示应用于各语言；固定少量-shot 数（大多数任务 8、长上下文任务 4）。
分析分词器的生育性和预训练数据规模作为影响多语言表现的因素；评估 GPT-4 的测试数据污染风险。

((a)) Tasks and Datasets included in MEGA.

实验结果

研究问题

RQ1LLMs 在多语言基准测试中的表现相对于微调的 SOTA 模型在不同语言和任务上的情况如何？
RQ2哪些语言及语言家族在生成式 LLMs 的表现中最强或最弱，原因何在？
RQ3哪些提示策略能带来最佳的多语言表现，并且它们在任务和语言上有何差异？
RQ4分词器质量和预训练数据规模在多大程度上解释了跨语言观察到的表现差距？
RQ5测试数据污染风险如何影响多语言评估结果，以及这对解释的影响？

主要发现

LLMs 在大多数任务上通常落后于微调的 SOTA 模型，尤其是非英语语言，尽管 GPT-4 在某些情况下缩小了差距。
翻译-测试提示通常在低资源和非拉丁脚本语言上带来显著提升，有时甚至超过单语言提示的表现；在高资源语言上提升较小。
分词器生育性（子词生成）与多语言某些任务的表现呈负相关，表明分词效果差会降低结果。
每语言的预训练数据规模与多语言任务中的表现正相关（如 PAWS-X、XNLI、XCOPA、XQuAD 等）；分词质量与数据规模协同解释多语言表现。
GPT-4 相比 GPT-3.5 在许多数据集上显示出强劲提升，但对于预训练数据有限且脚本复杂的语言，差距仍然存在。
提示设计选择（解释、少量示例数量）对结果的影响各异；某些选择（如解释）在某些 XCOPA 任务上可能几乎无影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。