[論文レビュー] Generating Wikipedia by Summarizing Long Sequences
The paper treats Wikipedia article generation as a multi-document abstractive summarization task, introducing a decoder-only Transformer variant capable of handling very long input sequences to generate coherent Wikipedia text.
We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient information and a neural abstractive model to generate the article. For the abstractive model, we introduce a decoder-only architecture that can scalably attend to very long sequences, much longer than typical encoder- decoder architectures used in sequence transduction. We show that this model can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles. When given reference documents, we show it can extract relevant factual information as reflected in perplexity, ROUGE scores and human evaluations.
研究の動機と目的
- Motivate generating Wikipedia articles as multi-document summarization from diverse reference texts.
- Propose a two-stage extractive-abstractive framework to manage very long inputs.
- Develop and evaluate decoder-only Transformer architectures that handle long sequences.
- Demonstrate that abstractive models can produce fluent, cohesive Wikipedia-style text given reference documents.
提案手法
- Define a WikiSum dataset combining citations and web-search documents as reference inputs and Wikipedia text as targets.
- Use an extractive stage to select salient input text with methods including tf-idf, TextRank, SumBasic, and a cheating extractor.
- Train an abstractive stage that treats very long inputs (up to 11000 tokens) to generate multi-sentence Wikipedia leads.
- Propose a decoder-only Transformer variant (T-D) and enhancements (T-DMCA) with local and memory-compressed attention for long sequences.
- Incorporate a memory-efficient architecture with optional mixture-of-experts (MoE) layers to scale capacity.
- Evaluate using perplexity and ROUGE-L F1, supplemented by human linguistic quality judgments.
実験結果
リサーチクエスチョン
- RQ1Can long input multi-document inputs be effectively summarized into Wikipedia-like text using abstractive models?
- RQ2Does a decoder-only Transformer outperform encoder-decoder setups on long-sequence summarization tasks?
- RQ3How does input extraction quality impact final abstractive performance in multi-document summarization for Wikipedia leads?
- RQ4What architectural adaptations (local and memory-compressed attention, MoE) enable processing of very long sequences?
- RQ5Can the approach generate fluent leads and full articles conditioned on reference documents?
主な発見
| Model | Test perplexity | ROUGE-L |
|---|---|---|
| seq2seq-attention, L=500 | 5.04952 | 12.7 |
| Transformer-ED, L=500 | 2.46645 | 34.2 |
| Transformer-D, L=4000 | 2.22216 | 33.6 |
| Transformer-DMCA, no MoE-layer, L=11000 | 2.05159 | 36.2 |
| Transformer-DMCA, MoE-128, L=11000 | 1.92871 | 37.9 |
| Transformer-DMCA, MoE-256, L=7500 | 1.90325 | 38.8 |
- A two-stage extractive-abstractive framework yields fluent, coherent Wikipedia leads conditioned on multi-document references.
- Smart extraction (tf-idf) significantly boosts abstractive performance over naive extraction baselines.
- Decoder-only Transformer variants (T-D, T-DMCA) outperform seq2seq-att and standard Transformer-ED on long inputs, with perplexities as low as 1.90 and ROUGE-L up to 38.8 on combined data.
- Memory-efficient attention (local and memory-compressed) enables processing sequences up to 11,000 tokens, increasing modeling capacity and performance.
- Mixtures of experts (MoE) further improve perplexity and ROUGE when scaling the model to long inputs.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。