QUICK REVIEW

[论文解读] ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin B. Clark, Minh-Thang Luong|arXiv (Cornell University)|Mar 23, 2020

Topic Modeling参考文献 48被引用 541

一句话总结

ELECTRA 引入了被替换的令牌检测，这是一种判别式预训练任务，其中一个生成器创建看似合理的令牌替换，判别器学习检测哪些令牌被替换。与 MLM 基于方法如 BERT 相比，这在下游性能上具有更高的提升，同时计算量大幅减少。

ABSTRACT

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

研究动机与目标

提升 Transformer 编码器的预训练效率和性能，与像 BERT 的 masked language modeling (MLM) 相比。
开发一种判别式预训练任务，使用来自生成器的替换而非对令牌进行掩蔽。
使模型能够从所有输入令牌学习，而不仅仅是被掩蔽的子集，以加速收敛并改善表征。
在 GLUE 和 SQuAD 基准上展示对小型与大型模型阶段的可扩展性和效率。

提出的方法

提出一个由生成器 G 和判别器 D 组成的两网络预训练，二者都基于 Transformer 编码器。
通过用来自 G 的样本替换一部分令牌来污染输入，形成被污染的序列。
训练 D 以预测每个令牌是原始令牌还是生成器替换的令牌（被替换的令牌检测）。
使用最大似然的 masked language modeling 训练 G，以生成看似合理的替换（非对抗性地）。
使用组合目标：L = E[MLM loss of G] + lambda * E[Disc loss of D]，其中 Disc loss 对污染序列中的每个令牌执行二分类。
探索 G 与 D 之间的权重共享（嵌入共享，有时甚至权重整体绑定）以及不同的生成器大小，以在计算量和性能之间取得平衡。
在 GLUE 和 SQuAD 上进行评估，在相似的计算和数据条件下将 ELECTRA 与 BERT、XLNet、RoBERTa、GPT 进行比较。

实验结果

研究问题

RQ1通过被替换的令牌检测从所有输入令牌学习，是否相对于传统 MLM 预训练在效率和性能上有所提升？
RQ2生成器大小、权重共享策略和训练算法如何影响 ELECTRA 的样本效率和下游性能？
RQ3ELECTRA 是否能够在更少的预训练计算量下达到可比或优于最先进模型（RoBERTa、XLNet）的结果？
RQ4在小型模型情境下以及 SQuAD 2.0 的可回答性任务中，ELECTRA 的表现如何？

主要发现

在相同的模型大小、数据量和计算量下，ELECTRA 在 GLUE 和 SQuAD 上显著超越基于 MLM 的方法（如 BERT）。
ELECTRA-Small 在 1 个 GPU 上训练 4 天即可超过 GPT，并且与更大模型具有竞争力，同时需要的计算量更少，参数也更少。
在大规模设置中，ELECTRA-Large 实现的性能可与 RoBERTa 和 XLNet 相当，在预训练计算量不到 1/4 的情况下，且在使用相似计算量时超过它们。
从所有输入令牌学习（判别器目标）是提高效率和性能的主要原因；相对于判别器使用较小的生成器进行训练进一步提升了结果。
两阶段和对抗训练变体没有超越联合 ML 目标；对生成器的最大似然训练带来更好的下游结果。
在不同模型大小下，ELECTRA 的增益在模型越小时越明显，表明参数效率提高。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。