QUICK REVIEW

[논문 리뷰] Flowformer: Linearizing Transformers with Conservation Flows

Haixu Wu, Jialong Wu|arXiv (Cornell University)|2022. 02. 13.

Neural Networks and Reservoir Computing인용 수 31

한 줄 요약

Flowformer는 흐름 보존(flow conservation)을 기반으로 흐름-주의(flow-attention)를 도입하여 트랜스포머의 주의(attention)를 선형화하고, 길이가 긴 시퀀스, 언어, 비전, 시계열 및 강화 학습 전반에서 경쟁력 있는 성능으로 선형 시간 복잡도를 달성한다.

ABSTRACT

Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling up to bigger models. Previous methods mainly utilize the similarity decomposition and the associativity of matrix multiplication to devise linear-time attention mechanisms. They avoid degeneration of attention to a trivial distribution by reintroducing inductive biases such as the locality, thereby at the expense of model generality and expressiveness. In this paper, we linearize Transformers free from specific inductive biases based on the flow network theory. We cast attention as the information flow aggregated from the sources (values) to the sinks (results) through the learned flow capacities (attentions). Within this framework, we apply the property of flow conservation into attention and propose the Flow-Attention mechanism of linear complexity. By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions without using specific inductive biases. Empowered by the Flow-Attention, Flowformer yields strong performance in linear time for wide areas, including long sequence, time series, vision, natural language, and reinforcement learning. The code and settings are available at this repository: https://github.com/thuml/Flowformer.

연구 동기 및 목표

주의에 대한 흐름-네트워크 시각을 도입하여 귀납적 편향에 대한 의존을 제거한다.
Flow-Attention을 흐름 보존 하에서 출처-경쟁과 싱크 할당으로 개발한다.
다양한 도메인에서 성능을 보존하는 선형 시간 주의(attention)를 입증한다.

제안 방법

주어진 주의(attention)를 학습된 흐름 용량(attentions)을 통해 출처(values)에서 싱크(결과)로의 정보 흐름으로 재정의한다.
흐름 보존(flow conservation)을 적용하여 지역성 편향 없이 출처 간 경쟁과 싱크 할당을 유도한다.
Flow-Attention을 경쟁과 집계(aggregation) 단계로 정의하고, 흐름 용량을 위한 음이 아닌 비선형 투영 φ(·)를 사용한다.
흐름 보존을 강제하기 위해 φ(K)를 나가는 흐름으로, φ(Q)를 들어오는 흐름으로 정규화한다(Eq. 5).
보존되는 들어오는/나가는 흐름(Ĩ와 Ŏ)을 계산하고 Flow-Attention을 도출한다: Competition (Softmax(Ŏ)·V), Aggregation (φ(Q)/I (φ(K)ᵀĤV)), Allocation (Sigmoid(Ĩ)⊙A) (Eq. 8).
Transformer의 표준 주의(attention)를 Flow-Attention으로 대체하여 Flowformer를 얻고 선형 시간 복잡도를 달성한다.

실험 결과

연구 질문

RQ1주어진 inductive biases를 고정하지 않고도 선형 복잡도에 도달하면서 주의(attention)를 비트러의 비트-트랜잭션 없이 비트러의 편향 없이 비가역적으로 만들 수 있는가?
RQ2흐름 보존 기반 Flow-Attention이 긴 시퀀스, 언어, 비전, 시계열 및 강화 학습에서 경쟁력 있는 성능을 제공하는가?
RQ3경쟁 및 할당 구성 요소가 주의의 질과 다운스트림 작업에 미치는 영향은 무엇인가?

주요 결과

모델	리스트옵스 ↑	텍스트 ↑	검색 ↑	이미지 ↑	패스파인더 ↑	평균 ↑
Flowformer	38.70	64.29	62.24	43.20	73.95	56.48
Flowformer w/o Allocation	37.00	63.78	61.33	42.52	73.26	55.58
Flowformer w/o Competition	36.80	63.48	61.66	42.39	71.90	55.25
Transformer (Vaswani et al., 2017)	36.37	64.27	57.46	42.44	71.40	54.39
BigBird (Zaheer et al., 2020)	36.05	64.02	59.29	40.83	74.87	55.01
cosFormer (Zhen et al., 2022)	37.90	63.41	61.36	43.17	70.33	55.23

Flowformer는 긴 시퀀스, 언어, 비전, 시계열, 온/오프라인 RL 벤치마크에서 강력한 baselines 대비 경쟁적이거나 우수한 결과를 달성한다.
Long-Range Arena에서 Flowformer는 56.48의 평균 정확도에 도달하여 vanilla Transformer 및 많은 효율적 주의 모델을 능가한다.
변형 실험에서 경쟁과 할당 각각이 성능 향상에 기여하는 것으로 나타났으며( LRA에서 각각 약 1.23 및 0.90의 평균 개선).
언어 모델링(WikiText-103)에서 Flowformer의 perplexity는 30.8로 기준선 및 비-경쟁/비-할당 버전보다 낫다(Flowformer w/o Competition 31.2, w/o Allocation 32.2).
ImageNet-1K에서 Flowformer는 선형 주의 기반 벤치마크를 매치하거나 능가하고, Top-1/Top-5 정확도에서 일부 전체 주의 모델에 근접하거나 이를 상회한다.
Flowformer는 선형 복잡도와 경쟁력 있는 정확도 및 우수한 효율성을 시퀀스 길이가 늘어날수록 시연한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.