QUICK REVIEW

[논문 리뷰] SOFT: Softmax-free Transformer with Linear Complexity

Jiachen Lu, Jinghan Yao|arXiv (Cornell University)|2021. 10. 22.

Advanced Neural Network Applications참고 문헌 45인용 수 61

한 줄 요약

SOFT는 Gaussian 커널과 Newton-Raphson를 통한 Moore-Penrose 역수를 이용한 Nyström 기반 저랭크 근사를 활용한 소프트맥스-프리 self-attention 메커니즘을 도입하여 선형 시간/공간 복잡도와 다른 선형 트랜스포머에 비해 ImageNet 정확도가 향상되었습니다.

ABSTRACT

Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. However, the employment of self-attention modules results in a quadratic complexity in both computation and memory usage. Various attempts on approximating the self-attention computation with linear complexity have been made in Natural Language Processing. However, an in-depth analysis in this work shows that they are either theoretically flawed or empirically ineffective for visual recognition. We further identify that their limitations are rooted in keeping the softmax self-attention during approximations. Specifically, conventional self-attention is computed by normalizing the scaled dot-product between token feature vectors. Keeping this softmax operation challenges any subsequent linearization efforts. Based on this insight, for the first time, a softmax-free transformer or SOFT is proposed. To remove softmax in self-attention, Gaussian kernel function is used to replace the dot-product similarity without further normalization. This enables a full self-attention matrix to be approximated via a low-rank matrix decomposition. The robustness of the approximation is achieved by calculating its Moore-Penrose inverse using a Newton-Raphson method. Extensive experiments on ImageNet show that our SOFT significantly improves the computational efficiency of existing ViT variants. Crucially, with a linear complexity, much longer token sequences are permitted in SOFT, resulting in superior trade-off between accuracy and complexity.

연구 동기 및 목표

시각 분야에서 긴 토큰 시퀀스를 다루기 위한 효율적 Transformer 필요성에 대한 동기를 제시한다.
선형 복잡도를 가능하게 하는 소프트맥스-프리 self-attention 메커니즘을 제안한다.
Robust한 주의(attention)를 위한 Newton-Raphson로 계산된 Moore-Penrose 역수를 이용한 Nyström 기반 저랭크 근사를 개발한다.
ImageNet에서 SOFT 기반 백본을 설계하고 정확도-복잡도 트레이드를 평가한다.

제안 방법

softmax 기반 주의 대신 대칭적이고 [0,1] 구간에 있는 Gaussian 커널 주의 S = exp(Q ⊖ K)을 사용한다.
작은 병목 m을 갖는 Nyström 분해를 사용해 전체 주의 행렬을 선형 시간/공간 복잡도로 근사하며, Ŝ = Pᵀ A† P를 얻는다.
수치적 강인성을 보장하기 위해 Newton–Raphson 반복으로 Moore-Penrose 역수 A†를 계산한다(A₀ = αA, A_{k+1} = 2A_k − A_k A A_k).
병목 토큰을 합성(컨벌루션), 평균 풀링 또는 다른 샘플링 방법으로 샘플링한다; 안정성과 효율성을 위해 평균 풀링을 선호한다.
특정 하이퍼파라미터(d_e, h, n, m, sp 등)로 SOFT를 피라미드형 비전 트랜스포머 백본의 한 층으로 인스턴스화하여 SOFT 변형들을 구성한다.

실험 결과

연구 질문

RQ1softmax-프리 Gaussian 커널 주의가 Vision Transformer에서 softmax 기반 주의와 유사한 정확도를 달성할 수 있는가?
RQ2Nyström 기반 저랭크 근사와 Moore-Penrose 역수가 비전 작업에서 안정적인 학습과 선형 복잡도를 제공하는가?
RQ3병목 크기 m, 샘플링 전략 등 어떤 설계 선택이 SOFT의 정확도-효율성 트레이드를 최적화하는가?
RQ4ImageNet과 NLP 벤치마크에서 SOFT가 다른 선형/효율 트랜스포머에 대해 어떤 성능을 보이는가?
RQ5시각 인식용 피라미드형 트랜스포머 백본에 SOFT를 통합했을 때 어떤 아키텍처적 이점이 생기는가?

주요 결과

SOFT는 주의에 대해 선형 시간 및 공간 복잡도(O(n))를 달성하여 더 긴 토큰 시퀀스를 가능하게 한다.
Newton–Raphson를 통한 Moore-Penrose 역수와 Nyström 기반 분해가 강력한 주의 근사를 제공한다.
ImageNet에서 SOFT 기반 백본은 정확도-복잡도 트레이드오프 면에서 다수의 CNN 및 ViT 변종을 능가한다.
같은 설정에서 Linformer, Performer 및 Nyströmformer에 비해 SOFT가 경쟁력 있거나 우수한 정확도를 보인다.
병목 크기 m ≈ 49가 정확도와 컴퓨트 사이에서 균형을 잘 맞추는 것으로 보이며, 테스트된 방법 중 평균 풀링 샘플링이 가장 우수하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.