QUICK REVIEW

[論文レビュー] MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations

Qishuai Wen, Zhiyuan Huang|arXiv (Cornell University)|Feb 1, 2026

Advanced Neural Network Applications被引用数 0

ひとこと要約

MiTA の注意力は圧縮とルーティングを統合して変形可能なファストウェイト・エキスパートを作成し、ランドマーククエリと上位-k 活性化を用いた長いシーケンス向けの効率的な注意力を実現します。これにより、従来の効率的注意手法を五次元分類に統合し、視覚タスクの競争力ある性能を実証します。

ABSTRACT

The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length N. As the context extends, the expressive capacity of such an N-width MLP increases, but scaling its fast weights becomes prohibitively expensive for extremely long sequences. Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through either routing or compression. Then we propose a compress-and-route strategy, which compresses the N-width MLP into a narrower one using a small set of landmark queries and constructs deformable experts by gathering top-k activated key-value pairs for each landmark query. We call this strategy a Mixture of Top-k Activations (MiTA), and refer to the resulting efficient mechanism as MiTA attention. Preliminary experiments on vision tasks demonstrate the promise of our MiTA attention and motivate further investigation on its optimization and broader applications in more challenging settings.

研究の動機と目的

長いシーケンスに対する Transformer の注意力のスケーリング問題を動機づける。
ファストウェイト視点から効率的な注意手法の統一的な五次元分類を導入する。
圧縮とルーティングの戦略で deformable なファストウェイト・エキスパートを作成するMiTA を提案する。
視覚タスクと長いシーケンスのベンチマークにおける MiTA の有効性を示し、計算上のトレードオフを議論する。

提案手法

全注意をシーケンス長 N に等しい幅の二層ファストウェイトMLPとして再定義する。
効率的な注意手法の五次元分類を提案する（スケーリング戦略、エキスパート数、エキスパートの種類、エキスパートの構成、ルーティングのトポロジー）。
MiTA を導入：ランドマーククエリを用いてグローバルなファストウェイトモジュールを圧縮し、各ランドマークに対して上位 k 個の活性化されたキー値を集約して deformable エキスパートを構築する。
ランドマーククエリを用いて共有グローバルエキスパートを形成し、ランドマーク値へのクロスアテンションでクエリを疎にルーティングし、結果を単一の注意操作に連結する。
MiTA（アルゴリズム）では m 個のランドマーククエリと k サイズの上位-k 選択を用いて注意のための K* と V* を形成する。
実装ノートと複雑さを議論し、全注意の二次へ対して O(N(m+ks)) の計算量を強調する。

Figure 1 : Fast-weight scaling and its two scaling strategies. As the context extends, the width of the two-layer fast-weight MLP induced by full attention increases accordingly. We categorize efficient fast-weight scaling approaches into two strategies: a) scaling by routing and b) scaling by compr

実験結果

リサーチクエスチョン

RQ1長いシーケンスに対して、 Expressive power をあまり犠牲にせずにファストウェイト注意を効率的にスケールさせるにはどうすればよいか？
RQ2圧縮とルーティングを組み合わせることで、グローバルコンテキストとトークンレベルの正確な取得の両方を注意に実現できるか？
RQ3入力内容に適応する固定数の deformable ファストウェイト・エキスパートを、実際のハードウェアに優しい形で実装する現実的な方法は？
RQ4MiTA の deformable エキスパートと共有グローバルモジュールは、視覚タスクと長いシーケンスのベンチマークで一般化できるか？

主な発見

MiTA は圧縮とルーティングを組み合わせることで O(N^2) ではなく O(N(m+ks)) の近似的線形スケーリングを達成する。
MiTA は m 個のランドマーククエリを用いてトップ-k 活性化とランドマーク値へのクロスアテンションを介して共有グローバルエキスパートを構築し、 deformable エキスパートを形成する。
ImageNet-1K において MiTA-ViT 系は ViT の性能と一致または近似し、同等設定下で Agent-ViT を上回る。
セマンティックセグメンテーションでは、MiTA 注目を用いたデコーダが全注意ベースラインに対して競争力のある mIoU を達成する。
Long Range Arena では、 MiTA はタスク間で高い精度を維持し、長いシーケンス長で全注意と比較して実行壁時計スループットが有利である。
エキスパート数 m と幅 k の変動に対して頑健であり、これらのパラメータを増やす方が減らすよりも一般化性能が向上する傾向を示す。

Figure 2 : Illustration for our MiTA attention. In full attention, each query attends to all key-value pairs. In our MiTA attention, it attends to the concatenation of a small number of the compressed key-value pairs and a routed subset of the full key-value pairs.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。