QUICK REVIEW

[논문 리뷰] Rethinking Attention with Performers

Krzysztof Choromański, Valerii Likhosherstov|arXiv (Cornell University)|2020. 09. 30.

Domain Adaptation and Few-Shot Learning참고 문헌 55인용 수 122

한 줄 요약

Performer는 FAVOR+를 도입하여 softmax 주의(attention)를 선형 공간/시간 복잡도로 근사하고, 희소성 사전지식 없이도 대규모 Transformer 유사 모델을 가능하게 하며, 엄밀한 정확도 보장과 표준 Transformers와의 호환성을 제공한다.

ABSTRACT

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

연구 동기 및 목표

희소성 또는 저랭크 사전 지식에 의존하지 않고 확장 가능한 주의 메커니즘의 필요성을 동기 부여한다.
softmax full-rank 주의(full-rank attention)를 선형 복잡도로 근사하는 Transformer 변종으로서 Performers를 소개한다.
편향 없는 커널 기반 주의 추정을 위한 FAVOR+ 메커니즘을 개발하고 형식화한다.
주 의 근사에 대한 이론적 보장(편향 없음, 균일 수렴, 낮은 분산)을 제공한다.
비전, 언어, 생물학 스타일의 시퀀스 모델링 과제에서 경험적 효과를 입증한다.

제안 방법

주의를 커널화된 형태로 정의하고 positive orthogonal random features (PRFs) 및 orthogonal random features (ORFs)를 사용하여 근사한다.
softmax 커널을 근사하기 위해 양의 랜덤 특징을 사용하는 FAVOR+를 지정하고 선형 공간/시간 주의 계산을 가능하게 한다.
주의 행렬의 편향 없는 또는 거의 편향 없는 추정을 균일 수렴성과 감소된 분산으로 증명한다.
정규화된 softmax 커널이 softmax를 잘 근사하여 실용적인 학습을 가능하게 한다.
표준 Transformers와의 통합을 위한 의사코드(pseudo code)를 제공하고 구현 세부 정보를 논의한다.

실험 결과

연구 질문

RQ1희소성이나 저랭크성 같은 사전지식 없이도 선형 공간/시간 복잡도로 softmax 주의가 정확하게 근사될 수 있는가?
RQ2다양한 과제에서 FAVOR+가 softmax 주의를 근사하는 데 얼마나 효과적인가?
RQ3Performer 근사에 대해 이론적 보장(편향 없음, 균일 수렴, 낮은 분산)이 성립하는가?
RQ4긴 시퀀스와 단백질/데이터 모델링 과제에서 다른 효율적 주의 방법과 비교했을 때 FAVOR+의 실험적 성능은 어떤가?
RQ5FAVOR+를 Transformer를 넘어 커널화 가능한 다른 주의 메커니즘에 적용할 수 있는가?

주요 결과

Performers는 선형 복잡성을 유지하면서 효율적인 주의 방법으로 경쟁력 있는 결과를 달성한다.
FAVOR+는 정규 소프트맥스 주의의 편향없거나 거의 편향없는 추정을 균일 수렴과 더 낮은 추정 분산으로 제공한다.
직교 및 양의 랜덤 특징은 평균 제곱 오차를 줄이고 실용적인 특징 수에서 정확한 주의 근사를 가능하게 한다.
경험적 결과는 우수한 속도/메모리 트레이드오프와 미리 학습된 Transformer 가중치와의 미세조정을 통한 호환성을 보여준다.
이 방법은 긴 시퀀스(예: large L)와 단백질 스타일 시퀀스 모델링에 확장되며, 선형 자원 하에서 Transformer 성능에 근접하거나 이를 추월한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.