QUICK REVIEW

[论文解读] Rethinking Attention with Performers

Krzysztof Choromański, Valerii Likhosherstov|arXiv (Cornell University)|Sep 30, 2020

Domain Adaptation and Few-Shot Learning参考文献 55被引用 122

一句话总结

Performer 引入 FAVOR+，以线性空间/时间近似 softmax 注意力，使大型类似 Transformer 的模型在没有稀疏先验的情况下可行，并具有可证明的准确性并与标准 Transformer 兼容。

ABSTRACT

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

研究动机与目标

说明在不依赖稀疏性或低秩先验的情况下，扩展注意力机制的必要性。
将 Performers 作为近似 softmax 全秩注意力且线性复杂度的 Transformer 变体进行介绍。
开发并形式化 FAVOR+ 机制，用于无偏核基注意力估计。
为注意力近似提供理论保证（无偏性、统一收敛、低方差）。
在视觉、语言和生物风格序列建模任务中展示经验有效性。

提出的方法

将注意力定义为核化形式，并通过正正交随机特征（PRFs）和正交随机特征（ORFs）近似。
具体说明使用正随机特征来近似 softmax 核，从而实现线性空间/时间注意力计算的 FAVOR+。
证明注意力矩阵的无偏或近无偏估计，具备均匀收敛性和降低的方差。
证明正则化的 softmax 核能较好近似 softmax，从而实现实际训练。
提供伪代码并讨论与标准 Transformer 集成的实现细节。

实验结果

研究问题

RQ1在没有稀疏性或低秩先验等前提下，softmax 注意力能否在线性空间/时间复杂度下实现准确近似？
RQ2正交正随机特征（FAVOR+）在跨越不同任务的 softmax 注意力近似中有多有效？
RQ3Performer 的近似是否具备理论保证（无偏性、均匀收敛、低方差）？
RQ4在长序列与蛋白质/数据建模任务上，FAVOR+ 与其他高效注意力方法相比的经验表现如何？
RQ5FAVOR+ 是否可以应用于除 Transformer 之外的其他可核化注意力机制？

主要发现

Performers 在保持线性复杂度的同时，取得了与高效注意力方法竞争力的结果。
FAVOR+ 提供对常规 softmax 注意力的无偏或近无偏估计，具有均匀收敛性和较低的估计方差。
正交和正随机特征降低均方误差，使在实际特征数量下实现准确的注意力近似成为可能。
经验结果显示有利的速度/内存权衡，并且通过微调与预训练 Transformer 权重兼容。
该方法可扩展到长序列（例如较大的 L）和蛋白质风格序列建模，在线性资源下达到或接近 Transformer 的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。