QUICK REVIEW

[论文解读] Flowformer: Linearizing Transformers with Conservation Flows

Haixu Wu, Jialong Wu|arXiv (Cornell University)|Feb 13, 2022

Neural Networks and Reservoir Computing被引用 31

一句话总结

Flowformer 引入基于流动守恒的 Flow-Attention，以线性化 Transformer 注意力，在长序列、语言、视觉、时间序列和强化学习领域实现线性时间复杂度与有竞争力的性能。

ABSTRACT

Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling up to bigger models. Previous methods mainly utilize the similarity decomposition and the associativity of matrix multiplication to devise linear-time attention mechanisms. They avoid degeneration of attention to a trivial distribution by reintroducing inductive biases such as the locality, thereby at the expense of model generality and expressiveness. In this paper, we linearize Transformers free from specific inductive biases based on the flow network theory. We cast attention as the information flow aggregated from the sources (values) to the sinks (results) through the learned flow capacities (attentions). Within this framework, we apply the property of flow conservation into attention and propose the Flow-Attention mechanism of linear complexity. By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions without using specific inductive biases. Empowered by the Flow-Attention, Flowformer yields strong performance in linear time for wide areas, including long sequence, time series, vision, natural language, and reinforcement learning. The code and settings are available at this repository: https://github.com/thuml/Flowformer.

研究动机与目标

引入注意力的流网络视角，以消除对归纳偏置的依赖。
在流动守恒下，开发具有源竞争与汇分配的 Flow-Attention。
展示在多样领域中仍能保持性能的线性时间注意力。

提出的方法

将注意力重新表述为信息流从源（数值）到汇（结果），通过学习得到的流容量（注意力）进行传递。
应用流动守恒，使源之间产生竞争，汇之间进行分配，而不产生局部性偏置。
定义具备竞争与聚合步骤的 Flow-Attention，使用非负非线性投影 φ(·) 作为流容量。
通过流出流量对 φ(K) 进行归一化，对 φ(Q) 进行入流归一化，以强制执行流动守恒（式 Eq. 5）。
计算守恒的进入/离开流量 (Ĩ 和 Ŏ) 并推导 Flow-Attention：竞争（Softmax(Ŏ)·V）、聚合（φ(Q)/I (φ(K)ᵀĤV)），分配（Sigmoid(Ĩ)⊙A）（式 Eq. 8）。
用 Flow-Attention 替换 Transformer 中的标准注意力，以获得线性时间复杂度的 Flowformer。

实验结果

研究问题

RQ1在不依赖固定归纳偏置的情况下，注意力是否可以变得非平凡且无偏，并实现线性复杂度？
RQ2基于流动守恒的 Flow-Attention 是否在长序列、语言、视觉、时间序列和强化学习等方面提供竞争力的性能？
RQ3竞争与分配组件对注意力质量及下游任务有何影响？

主要发现

模型	ListOps ↑	文本 ↑	检索 ↑	图像 ↑	Pathfinder ↑	平均 ↑
Flowformer	38.70	64.29	62.24	43.20	73.95	56.48
Flowformer w/o Allocation	37.00	63.78	61.33	42.52	73.26	55.58
Flowformer w/o Competition	36.80	63.48	61.66	42.39	71.90	55.25
Transformer (Vaswani et al., 2017)	36.37	64.27	57.46	42.44	71.40	54.39
BigBird (Zaheer et al., 2020)	36.05	64.02	59.29	40.83	74.87	55.01
cosFormer (Zhen et al., 2022)	37.90	63.41	61.36	43.17	70.33	55.23

Flowformer 在长序列、语言、视觉、时间序列和离线 RL 基准测试中，取得与强基线相当或更优的结果。
在 Long-Range Arena 中，Flowformer 达到 56.48 的平均准确率，超出 vanilla Transformer 和许多高效注意力模型。
消融实验显示，竞争和分配各自对性能提升有贡献（在 LRA 中分别带来约 1.23 和 0.90 的平均改进）
在语言模型（WikiText-103）上，Flowformer 的困惑度为 30.8，优于基线和消融（Flowformer 无竞争 31.2，无分配 32.2）。
在 ImageNet-1K 上，Flowformer 匹配或超过线性注意力基线，并在 Top-1/Top-5 准确率方面接近或超越一些全注意力模型。
Flowformer 展现出线性复杂度、具有竞争力的准确性和有利的效率，尤于序列长度增大时更为明显。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。