QUICK REVIEW

[논문 리뷰] Sparse Attention as Compact Kernel Regression

Saul Santos, Nuno Gonçalves|arXiv (Cornell University)|2026. 01. 30.

Domain Adaptation and Few-Shot Learning인용 수 0

한 줄 요약

이 논문은 희소 어텐션에 대한 커널 회귀 관점을 형식화하여, sparsemax와 alpha-entmax가 Epanechnikov 유사의 컴팩트 커널에서 비롯됨을 보여주고, 커널 기반 Memory Mosaics가 언어 모델링, 컨텍스트 학습, 및 길이 일반화에서 경쟁력 있는 성능을 입증함을 보여준다.

ABSTRACT

Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $α$-entmax attention with $α= 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n o \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.

연구 동기 및 목표

희소 어텐션을 밀집 소프트맥스의 원칙적 대안으로 제시하기 위해 비모수 커널 회귀에 근거를 두고 이론화한다.
컴팩트-서포트 커널이 어텐션의 희소성과 국소성을 어떻게 유도하는지 특징화한다.
Memory Mosaics 내에서 커널 기반 어텐션 변형을 개발하고 분석한다.
상위-k, 고정 정규화, 적응 희소 어텐션 메커니즘을 연결하는 통합 프레임워크를 제공한다.

제안 방법

주사어-왈슨(Nadaraya-Watson) 커널 회귀 뷰에서 어텐션을 검토하고 softmax를 가우시안 커널과 연결짓는다.
정규화된 ReLU, sparsemax, 및 alpha-entmax를 Epanechnikov 및 적응 대역폭을 갖는 관련 컴팩트 커널로 매핑한다.
alpha = 1 + 1/r일 때의 alpha-entmax가 K_h(u) ∝ [1 - ||u||^2 / h^2]^r 형태의 커널로 해석되며, 여기서 r = 1/(alpha - 1)임을 보인다.
top-k_uniform, top-k_softmax, 및 ReLUmax와 같은 닻이 있는 컴팩트 커널을 소개하고 이를 kNN 및 여유-기반(sparsity)와 연결한다.
Memory Mosaics를 커널 회귀 기반 트랜스포머 변형으로 제시하고, Nadaraya-Watson 업데이트에서 키/값이 어떻게 형성되고 사용되는지 설명한다.

실험 결과

연구 질문

RQ1컴팩트(경계 포함) 커널이 sparsemax 및 alpha-entmax와 같은 희소 어텐션 메커니즘과 어떻게 관련되는가?
RQ2현재의 희소 어텐션 방법들(top-k, 고정 정규화)을 하나의 통합 커널-회귀 프레임워크 내에서 특징지을 수 있는가?
RQ3Memory Mosaics의 커널 기반 희소 어텐션 변형이 언어 모델링, 컨텍스트 학습, 길이 일반화 작업에서 경쟁력 있는 성능을 제공하는가?
RQ4커널 설계에서 어떤 새로운 어텐션 변환(예: ReLUmax)이 등장하고, 실제로는 어떻게 동작하는가?

주요 결과

희소max 어텐션은 적응 대역폭을 갖는 자동 정규화된 Epanechnikov 커널 회귀에 해당한다.
alpha > 1인 alpha-entmax 어텐션은 r = 1/(alpha - 1)인 컴팩트-서포트 다항 커널에 해당하며, 이는 Epanechnikov, biweight, triweight 커널을 포괄한다.
Top-k 및 고정 정규화 희소 어텐션은 동일한 커널 회귀 프레임워크에 적합하며, top-k softmax를 kNN 회귀와 정규화된 ReLU에 대한 고정 대역폭 Epanechnikov 회귀와 연결한다.
새로운 변환인 ReLUmax는 커널 서포트를 최대 유사도 근처로 고정하고 특이하게 0으로 나눗셈이 발생하지 않도록 한다.
Memory Mosaics를 이용한 실험은 커널 기반의 희소 어텐션이 언어 모델링, 컨텍스트 학습, 길이 일반화에서 경쟁력 있는 성능을 보였고, 적응적 희소 커널이 종종 고정 희소성 베이스라인보다 우수한 성능을 낸다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.