QUICK REVIEW

[论文解读] Unlimiformer: Long-Range Transformers with Unlimited Length Input

Amanda Bertsch, Uri Alon|arXiv (Cornell University)|May 2, 2023

Handwritten Text Recognition Techniques被引用 23

一句话总结

Unlimiformer 在跨注意力方面为预训练的编码器-解码器 transformers 增加了一个单一的 k-最近邻索引，使测试时输入长度无限制且无需额外训练，并提升长文档与书籍摘要的性能。

ABSTRACT

Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index, while the returned kNN distances are the attention dot-product scores. This kNN index can be kept on either the GPU or CPU memory and queried in sub-linear time; this way, we can index practically unlimited input sequences, while every attention head in every decoder layer retrieves its top-k keys, instead of attending to every key. We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time. We demonstrate that Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code. We make our code and models publicly available at https://github.com/abertsch72/unlimiformer .

研究动机与目标

激发并实现处理超出标准上下文窗口的极长输入，而无需从头重新训练。
提出一种通用的非参数检索机制，以替代对所有输入标记的完整跨注意力。
证明单一的 k-NN 索引足以近似所有解码器层和头的注意力质量。
在多种基础模型和训练方案下的长文档和书籍摘要基准上展示改进。

提出的方法

在每个解码器层的跨注意力之前插入一个 k-NN 检索步骤，为每个头选择前 k 个键。
通过带重叠的分块对长输入进行编码，并对每个分块的隐藏状态的中间一半进行索引。
改写注意力计算，以允许通过每头投影 QWqWk^T 访问单一的编码器隐藏状态索引，从而在各层/各头之间实现一个共用的索引。
在解码时查询该索引，仅对检索到的前 k 个键进行注意，点积距离作为注意力分数。
使用 16-bit 隐藏状态来限制内存（例如，1,000,000 个标记时为 2 GB），并在需要时将索引卸载到 CPU/GPU。
提供低成本的测试时变体（+test Unlimiformer，+early stop w/ Unlimiformer）以及更长范围的训练方法（Random-encoded、Retrieval、Alternating）。
提供与 LLaMA-2 和 HuggingFace Transformers 兼容的代码库和模型发布。

实验结果

研究问题

RQ1是否可以将编码器-解码器 transformers 中的跨注意力卸载到 k-NN 索引上，以在测试时支持无限长度的输入？
RQ2跨所有解码器层/头的单一共享 k-NN 索引是否足以实现有效检索并保留大部分注意力质量？
RQ3我们能在不增加额外学习参数的情况下增强现有的预训练模型以处理无限长度输入吗？
RQ4在长期范围的摘要及相关任务上使用 Unlimiformer 时，准确性与计算成本的权衡是什么？

主要发现

Unlimiformer 在长文档摘要上无需额外训练即可提升基线模型（例如，BART_base +test Unlimiformer 比标准微调获得更高的 ROUGE/L 和 BERTScore）。
使用 Unlimiformer 的早停在没有额外训练成本的情况下提供显著收益（例如 GovReport：ROUGE-1 从 48.7 提升到 51.0）。
在使用 Unlimiformer 训练时，像 PRIMERA 这样的模型超越或匹配更大规模的长距离基线，且 Unlimiformer 还能进一步提升它们（例如 PRIMERA +test Unlimiformer 在 ROUGE/L 和 EntMent 上优于标准 PRIMERA）。
BookSum 结果显示使用 Unlimiformer 可获得 EntMent 增益（例如 Unlimiformer+PRIMERA 将 EntMent 从基线 PRIMERA 的 11.6 提升至 25.5）。
以检索为重点的训练变体（Retrieval、Random-encoded、Alternating）在各数据集上取得竞争性增益，最佳方法随模型和数据而异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。