QUICK REVIEW

[论文解读] XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Junjie Hu, Sebastian Ruder|arXiv (Cornell University)|Mar 24, 2020

Topic Modeling参考文献 59被引用 299

一句话总结

XTREME 引入一个广泛的零样本跨语言基准，覆盖 40 种语言和 9 个任务，用于评估多语言表示与迁移学习，揭示跨语言差距显著，尤其在句法和句子检索任务上。

ABSTRACT

Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.

研究动机与目标

激发需要一个超越英语为中心任务的全面跨语言评估基准的必要性。
提供多样且跨类型的语言与任务集合，以评估跨语言迁移能力。
推动标准化评估和基线方法，推动多语言表示学习的发展。
分析当前最先进的跨语言模型在不同语言和任务上的局限性。

提出的方法

将 Cross-lingual Transfer Evaluation of Multilingual Encoders（xtreme）基准定义为覆盖 40 种语言和 9 个任务。
采用零-shot 跨语言迁移，即训练数据仅为英语，在目标语言上进行测试。
组建覆盖分类、结构化预测和问答的任务，以在多层语言层面测试意义迁移。
提供伪造的（翻译）测试集用于诊断，覆盖所有语言并促进更广泛的分析。
评估强基线（mBERT、XLM、XLM-R、MMTE）和基于翻译的方法，发布代码和排行榜。
分析性能与预训练数据规模、语言家族和脚本之间的相关性，以理解迁移动态。

实验结果

研究问题

RQ1在零-shot 设置下，当前多语言表示在跨 9 个任务的 40 种 typologically 多样化语言上的迁移能力如何？
RQ2主要的跨语言迁移差距是什么，如何随任务、语言家族或脚本而变化？
RQ3翻译为基础的增强或语言内训练数据是否相对于零-shot 提升跨语言迁移？
RQ4模型性能与预训练数据规模及语言特征（家族、脚本）之间的相关性如何？
RQ5哪些诊断方法可以揭示最先进跨语言模型在多样语言上的局限性？

主要发现

零-shot 迁移模型在英语上接近人类水平，但在其他语言上有显著下降，尤其在句法和句子检索任务。
XLM-R Large 在零-shot 迁移中通常优于 mBERT 和其他基线，在 XQuAD 和 MLQA 上获得显著提升，但在结构化预测任务上的提升有限。
基于翻译的基线（translate-train、translate-test）提供了可观的提升，通常在各任务上缩小跨语言迁移差距。
语言内训练数据在若干任务上可以超过零-shot，但在英语数据充足时，零-shot 方法在复杂的问答任务上仍表现出强竞争力。
跨语言迁移在许多语言中与预训练数据规模相关，印欧语言效果更强，在汉藏、日语-琉球、朝鲜语和尼日尔-刚果语系中效应较弱。
在语言和任务之间仍存在显著的迁移差距，凸显跨语言迁移方法仍有改进空间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。