QUICK REVIEW

[論文レビュー] Generating Wikipedia by Summarizing Long Sequences

Peter J. Liu, Mohammad Saleh|arXiv (Cornell University)|Jan 30, 2018

Natural Language Processing Techniques参考文献 14被引用数 74

ひとこと要約

The paper treats Wikipedia article generation as a multi-document abstractive summarization task, introducing a decoder-only Transformer variant capable of handling very long input sequences to generate coherent Wikipedia text.

ABSTRACT

We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder- decoder architectures used in sequence transduction. We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles. When given reference documents, we show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.

研究の動機と目的

Motivate generating Wikipedia articles as multi-document summarization from diverse reference texts.
Propose a two-stage extractive-abstractive framework to manage very long inputs.
Develop and evaluate decoder-only Transformer architectures that handle long sequences.
Demonstrate that abstractive models can produce fluent, cohesive Wikipedia-style text given reference documents.

提案手法

Define a WikiSum dataset combining citations and web-search documents as reference inputs and Wikipedia text as targets.
Use an extractive stage to select salient input text with methods including tf-idf, TextRank, SumBasic, and a cheating extractor.
Train an abstractive stage that treats very long inputs (up to 11000 tokens) to generate multi-sentence Wikipedia leads.
Propose a decoder-only Transformer variant (T-D) and enhancements (T-DMCA) with local and memory-compressed attention for long sequences.
Incorporate a memory-efficient architecture with optional mixture-of-experts (MoE) layers to scale capacity.
Evaluate using perplexity and ROUGE-L F1, supplemented by human linguistic quality judgments.

実験結果

リサーチクエスチョン

RQ1Can long input multi-document inputs be effectively summarized into Wikipedia-like text using abstractive models?
RQ2Does a decoder-only Transformer outperform encoder-decoder setups on long-sequence summarization tasks?
RQ3How does input extraction quality impact final abstractive performance in multi-document summarization for Wikipedia leads?
RQ4What architectural adaptations (local and memory-compressed attention, MoE) enable processing of very long sequences?
RQ5Can the approach generate fluent leads and full articles conditioned on reference documents?

主な発見

Model	Test perplexity	ROUGE-L
seq2seq-attention, L=500	5.04952	12.7
Transformer-ED, L=500	2.46645	34.2
Transformer-D, L=4000	2.22216	33.6
Transformer-DMCA, no MoE-layer, L=11000	2.05159	36.2
Transformer-DMCA, MoE-128, L=11000	1.92871	37.9
Transformer-DMCA, MoE-256, L=7500	1.90325	38.8

A two-stage extractive-abstractive framework yields fluent, coherent Wikipedia leads conditioned on multi-document references.
Smart extraction (tf-idf) significantly boosts abstractive performance over naive extraction baselines.
Decoder-only Transformer variants (T-D, T-DMCA) outperform seq2seq-att and standard Transformer-ED on long inputs, with perplexities as low as 1.90 and ROUGE-L up to 38.8 on combined data.
Memory-efficient attention (local and memory-compressed) enables processing sequences up to 11,000 tokens, increasing modeling capacity and performance.
Mixtures of experts (MoE) further improve perplexity and ROUGE when scaling the model to long inputs.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。