QUICK REVIEW

[论文解读] ZeroS: Zero-Sum Linear Attention for Efficient Transformers

Jiecheng Lu, Xu Han|arXiv (Cornell University)|Feb 5, 2026

Stochastic Gradient Optimization Techniques被引用 0

一句话总结

ZeroS 引入零和线性注意力，从 softmax 中移除零阶项，允许负权重和对比性 token 交互，同时保持 O(N) 复杂度；在基准测试中可与 softmax 注意力相匹配或超越。

ABSTRACT

Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.

研究动机与目标

识别与凸性及均匀权重偏差相关的线性注意力方法的根本局限性。
开发能够支持负权重和对比更新的线性时间注意力机制。
证明 ZeroS 能在多样的序列建模任务中达到或超过 softmax 注意力的性能。
提供对零和注意力公式的稳定性与表达能力的理论-guarantees。

提出的方法

通过从 softmax 中移除零阶项 (1/t) 并对残差重新加权以获得零和权重来提出 ZeroS。
引入径向–角度解耦，通过分离幅度和方向并对一阶和高阶 softmax 残差应用学习门控，然后重新引入带符号的 cos(theta) 项。
将重新加权的零和 softmax 表述为具备两个门控的形式，控制一阶和高阶分量，通过前缀和实现线性时间计算。
整合旋转位置嵌入（RoPE）以包含角度信息并在注意力权重中保持零和性质。
通过可分离的对数几率和基于前缀和的计算，维持 O(N d^2) 时间和 O(d^2) 内存，支持高效训练与推理。

实验结果

研究问题

RQ1从 softmax 中移除零阶项是否能产生数值稳定、具表达性的零和权重，并允许负值？
RQ2ZeroS 在多样任务上是否能实现线性时间注意力，同时达到或超过标准 softmax 注意力的性能？
RQ3径向–角度解耦与门控如何影响线性注意力的表达能力与稳定性？
RQ4ZeroS 能否与 RoPE 有效整合以保留注意力中的角度交互？
RQ5ZeroS 在 MAD、WikiText、图像分类和时序预测等基准测试中的实际增益为何？

主要发现

ZeroS 在多个基准上匹配或超过标准 softmax 注意力，同时保持线性时间复杂度。
移除零阶项使得负权重和对比性 token 交互成为可能而不牺牲稳定性。
带门控的径向–角度解耦提升了表达能力，超越单纯凸组合，支持更高阶的 token 交互。
与其他线性方法相比，ZeroS 在 MAD、WikiText-103、ImageNet-1k 风格任务以及时序数据集上表现具有竞争力甚至优于对手。
消融研究表明重新引入零阶项可能对某些上下文任务不利，门控与归一化有助于稳定性与性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。