QUICK REVIEW

[论文解读] FlauBERT: Unsupervised Language Model Pre-training for French

Hang Le, Loïc Vial|arXiv (Cornell University)|Dec 11, 2019

Topic Modeling参考文献 72被引用 61

一句话总结

FlauBERT 是一个单语法国语 Transformer 语言模型，在大规模多样化的法国文本语料上进行预训练，在多项法语 NLP 任务上实现了最先进的结果，并发布了用于可重复评估的 FLUE 基准。

ABSTRACT

Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research community for further reproducible experiments in French NLP.

研究动机与目标

通过利用大规模未标注的法语文本来改进法语 NLP，以获得上下文表示。
开发单语法语 BERT 风格模型，在多样化任务上超过多语模型。
提供可重复的管道和一个法语 NLP 评估基准（FLUE）。
发布多个 FlauBERT 版本，并在任务上与 CamemBERT 和 mBERT 进行比较。

提出的方法

使用 MLM 目标对两个 FlauBERT 变体（base 和 large）进行预训练（不含 NSP）在一个 71 GB 的法语语料库，该语料由 24 个子语料库编译。
使用带字节对编码的 50K BPE 词汇表，以及在 BPE 之前的一个基础法语分词器。
采用前归一化的 Transformer 和随机深度来稳定大型模型的训练。
在大量的 GPU 资源上进行训练（base 使用 32 张 GPU，large 使用 128 张 GPU），并通过仔细调优的学习率、热身和 Adam 优化进行训练。
将 FlauBERT 与 mBERT、CamemBERT 和 XLM-R 在一系列法语 NLP 任务中进行比较。
提供预处理和训练脚本，以及统一的 FLUE 基准，以实现法语 NLP 的可重复评估。

实验结果

研究问题

RQ1在一个大型异构法语语料上训练的单语法语 Transformer 模型，能否在法语 NLP 任务上优于多语模型？
RQ2模型规模（base 与 large）对在多样化法语 NLP 任务上的性能有何影响？
RQ3与现有法语和多语模型相比，单语法语模型能否在全面的法语评估套件（FLUE）上达到最先进的结果？

主要发现

FlauBERT 在若干法语 NLP 任务上优于像 mBERT 这样的多语模型。
大型 FlauBERT 模型通常在各任务上给出最佳结果，相较于基线模型，在若干设置中也可与 CamemBERT 相抗衡。
在解析任务上，基于 FlauBERT 的系统表现强劲，集成设置可带来进一步提升。
尽管在某些配置下数据量较少，FlauBERT 在多任务上对 CamemBERT 仍具竞争力或更优。
发布了统一的 FLUE 基准以促进法语 NLP 系统的可重复评估。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。