QUICK REVIEW

[논문 리뷰] Softmax Linear Attention: Reclaiming Global Competition

Xu, Mingwei, Xuan Lin|arXiv (Cornell University)|2026. 02. 02.

Topic Modeling인용 수 0

한 줄 요약

Softmax Linear Attention(SLA)은 선형 어텐션에 head- 수준 softmax 경쟁을 재도입하여 의미적 헤드 간에 winner-take-all 스타일의 선택성을 달성하면서 선형 시간과 메모리를 유지합니다. 여러 선형 베이스라인에서 장문-context 작업의 검색 신뢰성과 로버스트성을 향상시킵니다.

ABSTRACT

While linear attention reduces the quadratic complexity of standard Transformers to linear time, it often lags behind in expressivity due to the removal of softmax normalization. This omission eliminates \emph{global competition}, a critical mechanism that enables models to sharply focus on relevant information amidst long-context noise. In this work, we propose extbf{Softmax Linear Attention (SLA)}, a framework designed to restore this competitive selection without sacrificing efficiency. By lifting the softmax operation from the token level to the head level, SLA leverages attention heads as coarse semantic slots, applying a competitive gating mechanism to dynamically select the most relevant subspaces. This reintroduces the ``winner-take-all'' dynamics essential for precise retrieval and robust long-context understanding. Distinct from prior methods that focus on refining local kernel functions, SLA adopts a broader perspective by exploiting the higher-level multi-head aggregation structure. Extensive experiments demonstrate that SLA consistently enhances state-of-the-art linear baselines (RetNet, GLA, GDN) across language modeling and long-context benchmarks, particularly in challenging retrieval scenarios where it significantly boosts robustness against noise, validating its capability to restore precise focus while maintaining linear complexity.

연구 동기 및 목표

선형 어텐션에서 softmax를 제거함으로써 생기는 표현력 격차(마그니튜드 누적, 컨텍스트 붕괴)를 식별합니다.
인터-헤드 경쟁을 유지하면서 선형 복잡성을 보존하기 위해 SLA를 제안합니다.
마그니튜드 민감도 회복 및 점근적 winner-take-all 동역학을 이론적으로 분석합니다.
언어 모델링과 장문-context 작업에서 SLA의 효과를 최신 선형 베이스라인(RetNet, GLA, GDN)에 적용하여 입증합니다.]
method:["Q와 K에 헤드-수준 softmax 게이트를 추가하여 다중-헤드 합성을 재정의하고 헤드 간 경쟁을 생성합니다.","SLA 출력은 O_SLA = Concat_h ((G^Q_h ⊙ φ(Q_h)) (G^K_h ⊙ φ(K_h))^T V_h) W^O로 표현됩니다.","저랭크 헤드 프로젝션 W_GQ, W_GK를 사용하여 G^Q_h = softmax(Q W_GQ)_h 및 G^K_h = softmax(K W_GK)_h를 계산합니다.","선형 복잡성을 유지하기 위해 재귀적 및 청크 기반 학습 구현을 제공합니다.","가벼운 매개변수 추가(레이어당 두 개의 프로젝션 행렬)로 과도한 오버헤드 없이 제공합니다.","마그니튜드 민감도 회복 및 점근적 winner-take-all 동역학을 보이는 이론적 결과를 제시합니다.]
research_questions:["헤드-수준 softmax 경쟁이 선형 어텐션에서 잃어버린 전역 선택성을 회복할 수 있을까?","SLA가 빠른(attention) 할당을 제공하면서 선형 시간/공간 복잡성을 유지하는가?","SLA-장착 선형 베이스라인(RetNet, GLA, GDN)은 더 나은 검색 및 장문-context 성능을 달성하는가?","SLA에서 마그니튜드 민감도 및 winner-take-all 동역학에 대한 이론적 지지가 있는가?","모델 크기에 따른 학습/추론 효율성과 확장성에 SLA가 미치는 영향은 어떤가?]
key_findings:["SLA는 쿼리/키의 크기에 반응하도록 헤드 게이트를 만들어 마그니튜드 민감도를 회복하고 확신 있는 샤프 포커싱을 가능하게 합니다.","모델 자신감이 증가함에 따라 SLA의 헤드 게이트가 단일 헤드에 집중하는 경향이 있어 일종의 원-핫 경쟁에 근사합니다.","실험적 결과는 SLA가 실제 작업에서 기준 선형 모델과 비교해 검색 정확도를 향상시킵니다(예: Softmax-GLA, Softmax-RetNet, Softmax-GDN).","다수의 장문-context 벤치마크에서 SLA는 선형 베이스라인의 성능을 지속적으로 향상시키며 전체 소프트맥스 트랜스포머에 대한 격차를 줄입니다.","변형 연구에서 더 많은 헤드(H)가 SLA 이점을 증폭시키며 의미적 슬롯 경쟁 가설을 검증합니다.","학습 및 추론은 가벼운 오버헤드를 동반하며 처리량은 유지되고 메모리 사용은 확장 가능합니다.]
table_headers:["모델","SWDE","SQuAD","FDA","Avg."]
table_rows:[["Transformer++","52.21","30.90","65.43","49.51"],["RetNet","19.71","27.28","12.89","19.96"],["Softmax RetNet","30.51","32.74","9.98","24.41"],["GLA","22.41","25.84","9.26","19.17"],["Softmax GLA","33.48","31.27","15.88","26.88"],["GDN","41.40","34.05","29.13","34.86"],["Softmax GDN","41.80","34.76","28.96","35.17"]]}{
title

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.