QUICK REVIEW

[Paper Review] MASS: Masked Sequence to Sequence Pre-training for Language Generation

Kaitao Song, Xu Tan|arXiv (Cornell University)|May 7, 2019

Natural Language Processing Techniques580 citations

TL;DR

MASS pre-trains an encoder–decoder model by predicting a masked fragment of a sentence, improving zero/low-resource language generation tasks such as NMT, text summarization, and conversational response generation, and achieving state-of-the-art unsupervised NMT BLEU scores.

ABSTRACT

Pre-training and fine-tuning, e.g., BERT, have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks. Inspired by the success of BERT, we propose MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks. MASS adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence: its encoder takes a sentence with randomly masked fragment (several consecutive tokens) as input, and its decoder tries to predict this masked fragment. In this way, MASS can jointly train the encoder and decoder to develop the capability of representation extraction and language modeling. By further fine-tuning on a variety of zero/low-resource language generation tasks, including neural machine translation, text summarization and conversational response generation (3 tasks and totally 8 datasets), MASS achieves significant improvements over the baselines without pre-training or with other pre-training methods. Specially, we achieve the state-of-the-art accuracy (37.5 in terms of BLEU score) on the unsupervised English-French translation, even beating the early attention-based supervised model.

Motivation & Objective

Motivate pre-training for language generation tasks with encoder–decoder architectures.
Propose MASS to jointly pre-train encoder and decoder by reconstructing masked sentence fragments.
Show MASS improves zero/low-resource NMT, summarization, and conversational response generation over baselines.
Demonstrate MASS achieves state-of-the-art unsupervised NMT BLEU scores on multiple language pairs.

Proposed method

Model uses a Transformer encoder–decoder architecture.
Input is a sentence with a consecutive fragment masked by a special symbol; the decoder predicts the masked fragment conditioned on encoder representations.
The masking length k is a hyperparameter; MASS generalizes MLM (BERT) and standard LM (GPT) as special cases.
Encoder inputs mask 80% of tokens as [M], 10% random tokens, and 10% unchanged to balance learning.
During pre-training, MASS masks consecutive tokens in the encoder and masks the decoder input tokens that are unmasked in the encoder to encourage encoder reliance on representations.

Experimental results

Research questions

RQ1Can MASS jointly pre-train encoder and decoder on unlabeled data to benefit language generation tasks?
RQ2How does the masked fragment length k affect pre-training effectiveness and downstream task performance?
RQ3Does MASS outperform existing pre-training approaches (e.g., BERT+LM, DAE, XLM) for encoder–decoder generation tasks under zero/low-resource settings?
RQ4Is MASS effective across diverse generation tasks such as NMT, text summarization, and conversational response generation?

Key findings

MASS outperforms prior methods on unsupervised NMT across six translation directions, with en-fr BLEU of 37.50 and en-ro BLEU of 35.20 for the MASS 6-layer Transformer configuration.
In zero/low-resource NMT, MASS consistently surpasses baselines trained only on bilingual data and previous pre-training methods across all language pairs studied.
For text summarization, MASS improves ROUGE scores over baselines at multiple data scales, including a notable gain with as little as 10K data.
For conversational response generation, MASS yields lower perplexity than baselines on both 10K and 110K data settings.
Ablation studies show that masking consecutive tokens (vs. random discrete masking) and feeding masked decoder inputs are crucial for MASS effectiveness; MASS consistently beats Discrete masking and Feed variants.
Across tasks, MASS achieves state-of-the-art performance in unsupervised NMT, outperforming the previous best by more than 4 BLEU points on English-French.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.