QUICK REVIEW

[论文解读] Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters|arXiv (Cornell University)|Apr 10, 2020

Topic Modeling参考文献 54被引用 2,191

一句话总结

Longformer 引入了一种稀疏化的线性时间注意力机制（局部窗口 + 全局 token），以实现对长文档的处理，在文档级 NLP 任务上进行预训练和微调，并有一个用于长文档序列到序列任务（如摘要）的编码器-解码器变体（LED）。

ABSTRACT

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.

研究动机与目标

解决标准自注意力在长序列上的二次资源瓶颈。
提出一种可直接替换的注意力模式，结合局部窗口注意力和全局注意力，以实现长上下文建模。
展示在文档级 NLP 任务上，与 RoBERTa 兼容基线相比，预训练和微调的收益。
引入 Longformer-Encoder-Decoder (LED) 用于如摘要等长文档的序列到序列任务。
在长文档基准测试（WikiHop、TriviaQA、arXiv 摘要）上展示了最先进或强劲的结果。

提出的方法

将 Longformer 注意力定义为滑动窗口（局部）注意力与任务驱动的全局注意力的结合。
实现三种执行策略（Longformer-loop、Longformer-chunks、Longformer-cuda），在内存规模上实现线性增长。
使用 MLM 从 RoBERTa 权重继续对 Longformer 进行预训练，扩展位置嵌入以支持更长的序列。
在文档级任务（QA、指代消解、分类）上微调 Longformer，使用 RoBERTa 风格框架，在任务相关的 tokens 注入全局注意力。
通过将 Longformer 风格的注意力应用到编码器-解码器架构中，开发用于长文档摘要的 LED。
对窗口大小、扩张和全局注意力进行消融分析，以验证设计选择。

实验结果

研究问题

RQ1稀疏的线性时间注意力模式（局部窗口 + 全局 token）能否在长文档上达到或超过完整自注意力的性能？
RQ2对 Longformer 进行预训练并在文档级任务上微调，是否在分类、QA 和指代任务上相对于基于 RoBERTa 的基线带来改进？
RQ3Longformer 是否能够通过用于摘要的编码器-解码器变体（LED）支持长序列的 seq2seq 任务？
RQ4窗口大小、扩张和全局注意力如何影响长上下文基准测试的性能？
RQ5在类似的预训练-微调方案下，Longformer 相对于同 эпох 的长文档模型（如 Transformer-XL、Reformer、Sparse Transformer）的表现如何？

主要发现

Longformer 在一系列长文档任务（QA、指代、分类）上持续优于 RoBERTa 基线。
更长的上下文在长上下文的 QA 和文档级数据集（WikiHop、Hyperpartisan）上往往带来更大收益，而在短上下文任务上则较弱。
Longformer-large 在长上下文设置下在 WikiHop 和 TriviaQA 上达到最先进的结果，在 HotpotQA 上表现也具有竞争力。
使用 RoBERTa 的 MLM 进行预训练并将位置嵌入扩展至 4,096 个位置，使得在使用拷贝初始化的位置嵌入时，能够实现有效的长文档建模并快速收敛。
LED 展示了将 Longformerly 风格的注意力应用于编码器-解码器架构以进行长文档摘要（arXiv 数据集）的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。