QUICK REVIEW

[論文レビュー] HyperAttention: Long-context Attention in Near-Linear Time

In‐Su Han, Rajesh Jayaram|arXiv (Cornell University)|Oct 9, 2023

Advanced Neural Network Applications被引用数 9

ひとこと要約

HyperAttention は、摂動耐性のある対角 D 近似と AMM に基づく行ノルムベースのサンプリングを用いたほぼ線形時間の近似注意手法を導入し、因果マスクを可能にするとともに、長い文脈で実践的な速度向上を実現します。

ABSTRACT

We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which measure: (1) the max column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small. HyperAttention features a modular design that easily accommodates integration of other fast low-level implementations, particularly FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like FlashAttention. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets. For example, HyperAttention makes the inference time of ChatGLM2 50\% faster on 32k context length while perplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k, with causal masking, HyperAttention offers 5-fold speedup on a single attention layer.

研究の動機と目的

Motivation: to overcome quadratic time and memory in attention for long sequences in Transformers.
Goal: design an approximate attention mechanism with spectral guarantees that works under long contexts and causal masking.
Aim: achieve near-linear time while maintaining performance on real LLMs and long-context benchmarks.

提案手法

Define a sampling-based approximation for the attention output that preserves spectral properties.
Use Hamming sorted LSH to locate large entries in the attention matrix and form a sparse dominant-entry mask.
Approximate the diagonal scaling D with a near-linear time estimator that relies on uniform sampling of keys/rows.
Perform Approximate Matrix Multiplication by sampling rows of V according to squared row-norms to approximate D^{-1}AV.
Extend the approach to causal masking via a recursive block-partitioning scheme.
Provide a main theoretical guarantee (HyperAttention) that the output approximates Att within an epsilon spectral norm error with near-linear runtime.

Figure 2: Causal attention matrix can be divided into three equal-sized non-zero sections: ${\bm{M}}^{\mathcal{C}}_{1}\odot{\bm{A}}_{11}$ and ${\bm{M}}^{\mathcal{C}}_{2}\odot{\bm{A}}_{22}$ are both causal attention matrices, and ${\bm{A}}_{21}$ is an unmasked attention matrix.

実験結果

リサーチクエスチョン

RQ1Can we achieve near-linear time approximation for dot-product attention under long context lengths while preserving spectral properties?
RQ2Can heavy-entry detection plus row-norm-based sampling provide practical and provable guarantees without requiring bounded entries or low stable rank?
RQ3How does HyperAttention perform with causal masking and on pretrained LLMs for long-context benchmarks?

主な発見

Number of Replaced Layers	シングルQA	マルチQA	要約	few-shot	synthetic	コード
0 (exact)	80.63	68.14	53.12	186.58	84.00	99.57
7	72.34	65.37	52.54	182.12	82.00	102.06
14	71.32	62.97	52.13	169.02	80.00	92.12
21	44.75	48.91	50.90	150.44	31.00	82.58
28	27.07	22.65	46.28	65.74	8.33	73.68

HyperAttention yields near-linear time complexity for attention approximation under mild assumptions.
Empirically, it yields substantial speedups (e.g., over 50x faster in forward/backward for 131k context without masking; ~5x with causal masking) while maintaining competitive perplexity on long-context tasks.
When integrated into pretrained models (e.g., ChatGLM2-6B-32k, Phi-1.5) via monkey patches, it provides significant speedups with modest performance degradation on some tasks, notably preserving quality in summarization and code tasks.
The method supports causal masking via a recursive partitioning scheme (Algorithm 4) and remains effective across long-context datasets (LongBench).
An empirical assumption about row-norms of the softmax-attention matrix (alpha bound) holds in practice, supporting the method’s feasibility.

(a) $\mathtt{chatglm2}$ - $\mathtt{6b}$ - $\mathtt{32k}$

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。