QUICK REVIEW

[논문 리뷰] MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations

Qishuai Wen, Zhiyuan Huang|arXiv (Cornell University)|2026. 02. 01.

Advanced Neural Network Applications인용 수 0

한 줄 요약

MiTA 어텐션은 압축과 라우팅을 결합하여 변형 가능한 빠른 가중치 전문가를 만들어 길이 긴 시퀀스에서 랜드마크 쿼리와 탑-k 활성화를 활용한 효율적인 어텐션을 가능하게 한다. 이는 5차원 분류체계 아래 기존의 효율적 어텐션 방법들을 통합하고 시각 작업에서의 경쟁력 있는 성능을 보인다.

ABSTRACT

The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length N. As the context extends, the expressive capacity of such an N-width MLP increases, but scaling its fast weights becomes prohibitively expensive for extremely long sequences. Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through either routing or compression. Then we propose a compress-and-route strategy, which compresses the N-width MLP into a narrower one using a small set of landmark queries and constructs deformable experts by gathering top-k activated key-value pairs for each landmark query. We call this strategy a Mixture of Top-k Activations (MiTA), and refer to the resulting efficient mechanism as MiTA attention. Preliminary experiments on vision tasks demonstrate the promise of our MiTA attention and motivate further investigation on its optimization and broader applications in more challenging settings.

연구 동기 및 목표

매우 긴 시퀀스에서 트랜스포머의 어텐션 확장 문제의 필요성을 제기한다.
빠른 가중치 관점에서 효율적 어텐션 방법을 위한 다섯 차원으로 통합하는 분류체계를 도입한다.
MiTA를 제안한다. 압축-라우팅 전략으로 변형 가능한 빠른 가중치 전문가를 생성한다.
비전 과제와 긴 시퀀스 벤치마크에서 MiTA의 효과를 시연하고 계산 비용의 trade-off를 논의한다.

제안 방법

전체 어텐션을 길이 N인 2층 빠른 가중치 MLP로 재구성한다.
효율적 어텐션 방법에 대한 다섯 차원 분류체계(확장 전략, 전문가 수, 전문가 유형, 전문가 구성, 라우팅 토폴로지)를 제안한다.
MiTA를 도입한다: 랜드마크 쿼리를 통해 글로벌 빠른 가중치 모듈을 압축하고 각 랜드마크에 대해 상위-k 활성화된 key-value 쌍을 모아 변형 가능한 전문가를 구축한다.
랜드마크 쿼리를 사용하여 공유 글로벌 전문가를 형성하고 어텐션 연산 하나로 결과를 연결되도록 희소하게 라우팅한다.
MiTA라는 알고리즘을 제공하는데, 이는 m개의 랜드마크 쿼리와 k크기의 탑-k 선택을 사용하여 어텐션을 위한 K* 및 V*를 형성한다.
구현 노트와 복잡도에 대해 논의하며, 어텐션당 O(N(m+ks))와 전체 어텐션의 이차적 복잡도(제곱)를 대비해 강조한다.

Figure 1 : Fast-weight scaling and its two scaling strategies. As the context extends, the width of the two-layer fast-weight MLP induced by full attention increases accordingly. We categorize efficient fast-weight scaling approaches into two strategies: a) scaling by routing and b) scaling by compr

실험 결과

연구 질문

RQ1매우 긴 시퀀스에서 빠른 가중치 어텐션을 효과적으로 확장하면서 표현력을 너무 많이 희생하지 않는 방법은 무엇인가?
RQ2압축과 라우팅을 결합하여 어텐션에서 글로벌 맥락과 토큰 수준의 정밀 검색을 모두 얻을 수 있는가?
RQ3고정된 수의 변형 가능한 빠른 가중치 전문가를 입력 내용에 적응하도록 구현하는 실용적이고 하드웨어 친화적인 방법은 무엇인가?
RQ4MiTA의 변형 가능한 전문가와 공유 글로벌 모듈이 비전 과제와 긴 시퀀스 벤치마크에서 일반화되는가?

주요 결과

MiTA는 압축과 라우팅을 결합하여 어텐션당 복잡도를 O(N(m+ks))로 선형에 가까운 확장을 달성하며, 전체적으로 O(N^2)인 어텐션을 대체한다.
MiTA는 m개의 랜드마크 쿼리를 사용하여 변형 가능한 전문가를 상위-k 활성화를 통해 구축하고, 크로스 어텐션을 통해 랜드마크 값에 대한 공유 글로벌 전문가를 형성한다.
ImageNet-1K에서 MiTA-ViT 변형은 ViT 성능에 근접하거나 대략적으로 유사하며, 유사한 설정에서 Agent-ViT를 능가한다.
의미론적 분할에서 MiTA 어텐션 활성화가 적용된 디코더는 전체 어텐션 기반 기준치에 비해 경쟁력 있는 mIoU를 달성한다.
Long Range Arena에서 MiTA는 작업 간 높은 정확도를 유지하고 긴 시퀀스 길이에서 전체 어텐션 대비 벽시계 처리량이 우수한 편이다.
MiTA는 전문가 수 m과 폭 k를 다르게 조정해도 강건한 일반화 특성을 보이며, 이 파라미터를 늘릴 때 더 잘 일반화하는 경향을 보인다.

Figure 2 : Illustration for our MiTA attention. In full attention, each query attends to all key-value pairs. In our MiTA attention, it attends to the concatenation of a small number of the compressed key-value pairs and a routed subset of the full key-value pairs.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.