QUICK REVIEW

[論文レビュー] Sparse Attention as Compact Kernel Regression

Saul Santos, Nuno Gonçalves|arXiv (Cornell University)|Jan 30, 2026

Domain Adaptation and Few-Shot Learning被引用数 0

ひとこと要約

本論文は、疎注意のカーネル回帰視点を正式化し、sparsemax および alpha-entmax が Epanechnikov 型のコンパクトカーネルから生じることを示し、Kernel-based Memory Mosaics が言語モデリング・文脈内学習・長さ一般化の分野で競争力のある性能を示す。

ABSTRACT

Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $α$-entmax attention with $α= 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.

研究の動機と目的

疎注意を密な softmax の原理的な代替として位置づけ、ノンパラメトリック・カーネル回帰に基づいて動機づける。
コンパクトサポートカーネルが注意の sparsity と locality をどのように誘導するかを特徴づける。
Memory Mosaics 内で kernel ベースの注意変種を開発・分析する。
top-k、固定正規化、適応的 sparse attention メカニズムを統一的な枠組みで結びつける。

提案手法

注意の Nadaraya-Watson カーネル回帰視点を概観し、softmax をガウスカーネルと関連づける。
正規化された ReLU、sparsemax、alpha-entmax を Epanechnikov および関連するコンパクトカーネルへ適応帯域幅とともにマッピングする。
alpha = 1 + 1/r のとき alpha-entmax が K_h(u) ∝ [1 - ||u||^2 / h^2]^r のカーネル（r = 1/(alpha - 1)）に対応することを示す。
top-k 一様、top-k softmax、ReLUmax などのアンカー付きコンパクトカーネルを導入し、それらを kNN とマージンベースの sparsity に結びつける。
Memory Mosaics をカーネル回帰ベースのトランスフォーマー変種として提示し、Nadaraya-Watson の更新におけるキー/値の形成と使用方法を説明する。

実験結果

リサーチクエスチョン

RQ1コンパクトな（有界サポート）カーネルは sparsemax や alpha-entmax のような疎注意機構とどのように関係するか？
RQ2既存の sparse attention 手法（top-k、固定正規化）を統一的なカーネル回帰枠組みの中で特徴づけられるか？
RQ3Memory Mosaics におけるカーネルベースの疎注意変種は言語モデリング・文脈内学習・長さ一般化タスクで競争力を示すか？
RQ4新しい注意変換（例: ReLUmax）はカーネル設計からどのように生まれ、実践でどのように振る舞うか？

主な発見

疎注意は適応帯域幅を持つ自己正規化済み Epanechnikov カーネル回帰に対応する。
alpha > 1 の alpha-entmax 注意は r = 1/(alpha - 1) のコンパクトサポート多項式カーネルに対応し、Epanechnikov、biweight、triweight カーネルを包含する。
Top-k および固定正規化の疎注意は同一のカーネル回帰枠組みにはまる。top-k softmax は kNN 回帰へ、正規化された ReLU は固定帯域幅 Epanechnikov 回帰へ結びつく。
新しい変換である ReLUmax はカーネルのサポートを最大類似度付近にアンカーし、退化的なゼロ分母を回避する。
Memory Mosaics の実験では、カーネルベースの疎注意が言語モデリング・文脈内学習・長さ一般化の各タスクで競争力を示し、適応的な疎カーネルは固定スパース性のベースラインよりも優れることが多い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。