Skip to main content
QUICK REVIEW

[论文解读] MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations

Qishuai Wen, Zhiyuan Huang|arXiv (Cornell University)|Feb 1, 2026
Advanced Neural Network Applications被引用 0
一句话总结

MiTA 注意力通过融合压缩与路由,创建可变形的快速权重专家,实现对长序列的高效注意力,具备地标查询与前k激活的特征。它在五维分类法下统一了以往的高效注意力方法,并在视觉任务上展示了具竞争力的表现。

ABSTRACT

The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length N. As the context extends, the expressive capacity of such an N-width MLP increases, but scaling its fast weights becomes prohibitively expensive for extremely long sequences. Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through either routing or compression. Then we propose a compress-and-route strategy, which compresses the N-width MLP into a narrower one using a small set of landmark queries and constructs deformable experts by gathering top-k activated key-value pairs for each landmark query. We call this strategy a Mixture of Top-k Activations (MiTA), and refer to the resulting efficient mechanism as MiTA attention. Preliminary experiments on vision tasks demonstrate the promise of our MiTA attention and motivate further investigation on its optimization and broader applications in more challenging settings.

研究动机与目标

  • 为 Transformer 中对极长序列的注意力扩展规模问题提供动机。
  • 提出一个从快速权重角度出发的高效注意力方法五维统一分类法。
  • 提出 MiTA,一种通过压缩与路由创建可变形快速权重专家的策略。
  • 在视觉任务和长序列基准测试上展示 MiTA 的有效性,并讨论计算权衡。

提出的方法

  • 将全注意力重新表述为宽度等于序列长度 N 的两层快速权重 MLP。
  • 提出高效注意力方法的五维分类法(扩展规模策略、专家数量、专家类型、专家构造、路由拓扑)。
  • 引入 MiTA:通过地标查询压缩全局快速权重模块,并通过收集每个地标的前k激活键值对来构建可变形专家。
  • 利用地标查询形成共享全局专家,并对查询进行稀疏路由到可变形专家,将结果拼接成单一注意力操作。
  • 提供一种算法(MiTA),使用 m 个地标查询和 k 级前 k 选择来形成用于注意力的 K* 和 V*。
  • 讨论实现要点与复杂性,突出每次注意力的 O(N(m+ks)) 与全注意力平方复杂度的对比。
Figure 1 : Fast-weight scaling and its two scaling strategies. As the context extends, the width of the two-layer fast-weight MLP induced by full attention increases accordingly. We categorize efficient fast-weight scaling approaches into two strategies: a) scaling by routing and b) scaling by compr
Figure 1 : Fast-weight scaling and its two scaling strategies. As the context extends, the width of the two-layer fast-weight MLP induced by full attention increases accordingly. We categorize efficient fast-weight scaling approaches into two strategies: a) scaling by routing and b) scaling by compr

实验结果

研究问题

  • RQ1如何在不显著损失表达能力的前提下,为极长序列有效扩展快速权重注意力?
  • RQ2将压缩与路由结合是否能够在注意力中同时获得全局上下文与精确的令牌级检索?
  • RQ3一种实用、硬件友好且能适配输入内容的固定数量可变形快速权重专家的实现方式是什么?
  • RQ4MiTA 的可变形专家与共享全局模块是否具有跨视觉任务和长序列基准的泛化性?

主要发现

  • MiTA 通过结合压缩与路由实现线性级别的扩展,单次注意力复杂度为 O(N(m+ks)),而非 O(N^2)。
  • MiTA 使用 m 个地标查询通过前k激活构建可变形专家,并通过对地标值的跨注意力形成共享全局专家。
  • 在 ImageNet-1K 上,MiTA-ViT 的变体与 ViT 的性能相当或接近,且在可比设置下优于 Agent-ViT。
  • 在语义分割中,具备 MiTA 注意力的解码器在 mIoU 上对齐甚至超越全注意力基线。
  • 在 Long Range Arena 上,MiTA 在多任务中保持较高准确率,并在长序列长度下相对于全注意力具有更有利的实际吞吐量。
  • MiTA 对专家数量 m 与宽度 k 的变化表现出鲁棒性,当增大而非减小时具有更好的泛化性。
Figure 2 : Illustration for our MiTA attention. In full attention, each query attends to all key-value pairs. In our MiTA attention, it attends to the concatenation of a small number of the compressed key-value pairs and a routed subset of the full key-value pairs.
Figure 2 : Illustration for our MiTA attention. In full attention, each query attends to all key-value pairs. In our MiTA attention, it attends to the concatenation of a small number of the compressed key-value pairs and a routed subset of the full key-value pairs.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。