QUICK REVIEW

[论文解读] SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel

Jose Miguel Luna, Taha Bouhsine|arXiv (Cornell University)|Feb 4, 2026

Stochastic Gradient Optimization Techniques被引用 0

一句话总结

SLAY 提出一种几何感知的线性时间注意机制，通过在单位球查询/键上对 Yat-核进行线性化，在接近 softmax 的性能下实现 O(L) 时间和内存，并优于先前的线性注意方法。

ABSTRACT

We propose a new class of linear-time attention mechanisms based on a relaxed and computationally efficient formulation of the recently introduced E-Product, often referred to as the Yat-kernel (Bouhsine, 2025). The resulting interactions are geometry-aware and inspired by inverse-square interactions in physics. Our method, Spherical Linearized Attention with Yat Kernels (SLAY), constrains queries and keys to the unit sphere so that attention depends only on angular alignment. Using Bernstein's theorem, we express the spherical Yat-kernel as a nonnegative mixture of polynomial-exponential product kernels and derive a strictly positive random-feature approximation enabling linear-time O(L) attention. We establish positive definiteness and boundedness on the sphere and show that the estimator yields well-defined, nonnegative attention scores. Empirically, SLAY achieves performance that is nearly indistinguishable from standard softmax attention while retaining linear time and memory scaling, and consistently outperforms prior linear-time attention mechanisms such as Performers and Cosformers. To the best of our knowledge, SLAY represents the closest linear-time approximation to softmax attention reported to date, enabling scalable Transformers without the typical performance trade-offs of attention linearization.

研究动机与目标

提出一种线性时间的注意机制，在长期上下文建模中保持 Yat-kernel（E-Product）的几何属性。
将查询/键约束为单位范数，以解耦对齐与距离，便于线性化。
通过 Bernstein 定理得到正随机特征近似，以实现 O(L) 的注意力。
给出理论保证（正定性、边界性）与实际可扩展性。
在语言和视觉任务上对 SLAY 与 softmax 及先前线性时间方法进行经验对比，包括 Transformer 级别评估。

提出的方法

将 Yat-kernel 重新表述为单位范数查询/键的球面几何感知相似性。
通过 Bernstein 定理对分母进行线性化，使用 Laplace 表示，得到正混合的多项式-指数核。
用严格正的随机特征近似所得到的核（通过锚点特征等实现多项式部分，指数部分通过 PRF）进行近似。
用高斯-拉盖尔求积对积分进行离散化，得到核的有限和。
通过随机构化的张量草拟融合多项式与指数随机特征，形成可实现的线性时间注意力映射。
使用所提出的特征映射进行标准线性注意力收缩来计算注意力，而不需形成 L×L 的注意力矩阵。

实验结果

研究问题

RQ1是否通过将查询/键限制在单位球面来保持几何感知特性，从而实现 Yat-kernel 的线性时间？
RQ2 Bernstein 定理是否能够实现一个正向且可处理的球面 Yat-kernel 的随机特征表示，以支持 O(L) 注意力？
RQ3基于 SLAY 的 Transformer 是否在语言与视觉任务中实现接近 softmax 的性能，同时保持线性时间和内存扩展性？
RQ4与 Performers、Cosformers 及其他线性时间注意方法相比，SLAY 在准确性与可扩展性方面有何表现？
RQ5SLAY 在极端/大标签分类和全尺度 Transformer 训练中是否有效？

主要发现

SLAY 能匹配完整的球面 YAT 注意力，并在核心基准上通常优于先前的线性时间方法。
在匹配的特征预算下，Anchor 特征提供与其他方法相比显著更低的延迟的高准确性。
SLAY 展现出线性时间注意力，内存使用低于精确方法，在非常长的序列上也能保持吞吐量。
在极端分类（Eurlex-4K）中，SLAY 在 P@1、P@3、P@5 和 PSP@1/3/5 上优于 Performer/FAVOR+ 基线。
在 SLAYformer 实验中，SLAY 注意力的验证损失和困惑度接近标准 softmax，显著优于其他线性时间注意力基线。
SLAY 展示出稳定的训练和可扩展的性能，在较长上下文下接近 softmax 的水平，同时保持 O(L) 复杂度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。