Skip to main content
QUICK REVIEW

[論文レビュー] Generating Wikipedia by Summarizing Long Sequences

Peter J. Liu, Mohammad Saleh|arXiv (Cornell University)|Jan 30, 2018
Natural Language Processing Techniques参考文献 14被引用数 74
ひとこと要約

The paper treats Wikipedia article generation as a multi-document abstractive summarization task, introducing a decoder-only Transformer variant capable of handling very long input sequences to generate coherent Wikipedia text.

ABSTRACT

We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder- decoder architectures used in sequence transduction. We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles. When given reference documents, we show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.

研究の動機と目的

  • Motivate generating Wikipedia articles as multi-document summarization from diverse reference texts.
  • Propose a two-stage extractive-abstractive framework to manage very long inputs.
  • Develop and evaluate decoder-only Transformer architectures that handle long sequences.
  • Demonstrate that abstractive models can produce fluent, cohesive Wikipedia-style text given reference documents.

提案手法

  • Define a WikiSum dataset combining citations and web-search documents as reference inputs and Wikipedia text as targets.
  • Use an extractive stage to select salient input text with methods including tf-idf, TextRank, SumBasic, and a cheating extractor.
  • Train an abstractive stage that treats very long inputs (up to 11000 tokens) to generate multi-sentence Wikipedia leads.
  • Propose a decoder-only Transformer variant (T-D) and enhancements (T-DMCA) with local and memory-compressed attention for long sequences.
  • Incorporate a memory-efficient architecture with optional mixture-of-experts (MoE) layers to scale capacity.
  • Evaluate using perplexity and ROUGE-L F1, supplemented by human linguistic quality judgments.

実験結果

リサーチクエスチョン

  • RQ1Can long input multi-document inputs be effectively summarized into Wikipedia-like text using abstractive models?
  • RQ2Does a decoder-only Transformer outperform encoder-decoder setups on long-sequence summarization tasks?
  • RQ3How does input extraction quality impact final abstractive performance in multi-document summarization for Wikipedia leads?
  • RQ4What architectural adaptations (local and memory-compressed attention, MoE) enable processing of very long sequences?
  • RQ5Can the approach generate fluent leads and full articles conditioned on reference documents?

主な発見

ModelTest perplexityROUGE-L
seq2seq-attention, L=5005.0495212.7
Transformer-ED, L=5002.4664534.2
Transformer-D, L=40002.2221633.6
Transformer-DMCA, no MoE-layer, L=110002.0515936.2
Transformer-DMCA, MoE-128, L=110001.9287137.9
Transformer-DMCA, MoE-256, L=75001.9032538.8
  • A two-stage extractive-abstractive framework yields fluent, coherent Wikipedia leads conditioned on multi-document references.
  • Smart extraction (tf-idf) significantly boosts abstractive performance over naive extraction baselines.
  • Decoder-only Transformer variants (T-D, T-DMCA) outperform seq2seq-att and standard Transformer-ED on long inputs, with perplexities as low as 1.90 and ROUGE-L up to 38.8 on combined data.
  • Memory-efficient attention (local and memory-compressed) enables processing sequences up to 11,000 tokens, increasing modeling capacity and performance.
  • Mixtures of experts (MoE) further improve perplexity and ROUGE when scaling the model to long inputs.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。