QUICK REVIEW

[论文解读] PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Jingqing Zhang, Yao Zhao|arXiv (Cornell University)|Dec 18, 2019

Topic Modeling参考文献 45被引用 978

一句话总结

PEGASUS 引入 Gap Sentences Generation (GSG) 作为 Transformer 编码器-解码器模型的预训练目标，在 12 个抽象摘要数据集上取得最先进的结果，并在低资源场景下表现出色。

ABSTRACT

Recent work pre-training Transformers with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization. However, pre-training objectives tailored for abstractive text summarization have not been explored. Furthermore there is a lack of systematic evaluation across diverse domains. In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective. In PEGASUS, important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores. Our model also shows surprising performance on low-resource summarization, surpassing previous state-of-the-art results on 6 datasets with only 1000 examples. Finally we validated our results using human evaluation and show that our model summaries achieve human performance on multiple datasets.

研究动机与目标

Motivate abstractive summarization-specific pre-training objectives beyond general language modeling.
Develop a self-supervised objective that aligns pre-training with downstream summarization tasks.
Evaluate across diverse domains to assess generalization of the pre-training approach.
Demonstrate strong performance in low-resource fine-tuning settings.
Validate results with human evaluations to compare to or exceed human performance on several datasets.

提出的方法

Propose Gap Sentences Generation (GSG) where important sentences are masked and generated as a single output from the remaining text.
Compare several gap-sentence selection strategies (Lead, Random, Principal Ind-Orig/Ind-Uniq, Seq-Orig/Seq-Uniq).
Combine GSG with a Masked Language Model (MLM) objective in ablations, ultimately selecting GSG without MLM for the large model.
Pre-train Transformer encoder-decoder models on C4 and HugeNews corpora to learn to generate gap sentences from the rest of the document.
Fine-tune on 12 downstream abstractive summarization datasets spanning multiple domains; evaluate with ROUGE metrics.
Assess zero- and low-resource performance by fine-tuning with very small supervision (as few as 100–1000 examples).

实验结果

研究问题

RQ1Can a pre-training objective tailored for abstractive summarization improve downstream ROUGE scores across diverse domains?
RQ2How should gap sentences be selected (random, lead, or importance-based) to optimize downstream summarization performance?
RQ3What is the impact of gap-sentence ratio and vocabulary choices on pre-training effectiveness?
RQ4Does pre-training on domain-aligned corpora (HugeNews vs. C4) affect performance across different downstream tasks?
RQ5How does PEGASUS perform in low-resource and zero-shot fine-tuning settings compared to baselines?

主要发现

R1/R2/RL	数据集大小	Transformer_BASE	PEGASUS_BASE	先前的SOTA	PEGASUS_LARGE (C4)
30.83/10.83/24.41	XSum	39.79/16.58/31.70	45.14/22.27/37.25	45.20/22.06/36.99	47.21/24.56/39.25
38.27/15.03/35.48	CNN/DailyMail	41.79/18.81/38.93	44.16/21.28/40.90	43.90/21.20/40.76	44.17/21.47/41.11
40.28/27.93/36.52	NEWSROOM	42.38/30.06/38.52	39.91/28.38/36.87	45.07/33.39/41.28	45.15/33.51/41.33

GSG-based pre-training yields state-of-the-art results on 12 downstream summarization datasets (XSum, CNN/DM, NEWSROOM, Multi-News, Gigaword, arXiv, PubMed, BIGPATENT, WikiHow, Reddit TIFU, AESLC, BillSum).
Independently selected important sentences (Ind-Orig) for gap generation consistently outperforms random or lead-based strategies, with the GSR around 30% being effective across datasets.
PEGASUS-LARGE (pre-trained on HugeNews) achieves higher ROUGE scores than prior SOTA on most datasets, with notable gains on XSum and CNN/DM; WikiHow favors C4 pre-training.
In low-resource settings, PEGASUS-LARGE can match or exceed full-supervision baselines with as few as 1000 supervised examples on several datasets; zero-shot performance also shows strong results.
Human evaluation indicates PEGASUS outputs are often on par with or exceed human reference summaries on XSum, CNN/DM, and Reddit TIFU in multiple conditions.
A mixture of C4 and HugeNews with stochastic gap-sentence selection provides further improvements across many datasets (Table 4).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。