QUICK REVIEW

[論文レビュー] ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin B. Clark, Minh-Thang Luong|arXiv (Cornell University)|Mar 23, 2020

Topic Modeling参考文献 48被引用数 541

ひとこと要約

ELECTRA は置換トークン検出を導入し、生成器が妥当なトークン置換を作成し、識別器がどのトークンが置換されたかを検出する識別的な事前学習タスク。これにより、BERT のような MLM ベースの手法よりはるかに少ない計算量で下流の性能が向上する。

ABSTRACT

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

研究の動機と目的

BERT のような MLM（masked language modeling）と比較して、Transformer エンコーダの事前学習の効率と性能を向上させる。
トークンをマスクする代わりに、生成器からサンプルされた置換を使用する識別的な事前学習タスクを開発する。
マスクされた部分集合だけでなく、すべての入力トークンから学習を可能にして収束を加速し、表現を改善する。
GLUE および SQuAD ベンチマークにおいて、小規模および大規模モデルの regime にわたるスケーラビリティと効率を示す。

提案手法

生成器 G と識別器 D の2ネットワーク事前学習を提案する。両方とも Transformer エンコーダに基づく。
入力を、G からのサンプルでトークンの一部を置換して、破損した系列を形成する。
D を訓練して、各トークンが元のものか生成器の置換かを予測する（置換トークン検出）。
G を最大尤度のマスキング言語モデリングで訓練して、もっともらしい置換を生成させる（対抗的ではない）。
結合目的関数を用いる：L = E[MLM loss of G] + lambda * E[Disc loss of D]、Disc loss は破損系列の各トークンに対する二値分類。
G と D の重み共有を探索する（埋め込みを共有、時には全重みを結合）と、計算量と性能のバランスを取るために異なる生成器サイズを検討する。
GLUE および SQuAD で評価し、同様の計算量とデータ条件下で ELECTRA を BERT、XLNet、RoBERTa、GPT と比較する。

実験結果

リサーチクエスチョン

RQ1置換トークン検出を介してすべての入力トークンから学習することは、従来の MLM 事前学習と比較して効率と性能を向上させるか。
RQ2生成器のサイズ、重み共有戦略、学習アルゴリズムは ELECTRA のサンプル効率と下流の性能にどう影響するか？
RQ3事前学習計算量を抑えつつ、ELECTRA は RoBERTa、XLNet などの最先端モデルと同等またはそれ以上の結果を達成できるか？
RQ4小規模モデルの regime および SQuAD 2.0 の回答可能性タスクでの ELECTRA の性能はどうか？

主な発見

ELECTRA は同じモデルサイズ、データ量、計算量で、GLUE および SQuAD において MLM ベースの方法（例: BERT）より大幅に上回る。
ELECTRA-Small は 1 GPU で 4 日間の訓練で GPT を上回り、より大きなモデルと競合しつつ、はるかに少ない計算量とはるかに少ないパラメータ数で済む。
大規模設定では ELECTRA-Large は RoBERTa および XLNet と同等の性能を、同様の計算量で用いた場合より低い前訓練計算量で達成し、それらを上回る。
すべての入力トークンから学習すること（識別器目的）は効率と性能向上の主要な要因であり、識別子に対する小さな生成器を使用した訓練は結果をさらに改善する。
2段階および対サ技術的訓練の変種は結合ML目的を上回らず、生成器の最大尤度訓練が下流の結果をより良好に導いた。
モデルサイズを問わず、ELECTRA の利得はモデルサイズが小さくなるほど顕著で、パラメータ効率の改善を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。