Skip to main content
QUICK REVIEW

[论文解读] Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray|arXiv (Cornell University)|Apr 23, 2019
Topic Modeling参考文献 25被引用 489
一句话总结

本文提出稀疏因式自注意力(Sparse Transformers)以将变压器扩展到长序列,在文本、图像和音频上实现最先进的密度建模,并且能够通过数百层实现极长的上下文。

ABSTRACT

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.

研究动机与目标

  • 促进在文本、图像和音频等领域对长序列进行可扩展的自回归建模。
  • 通过稀疏因式分解将注意力的内存和计算从平方级降低到接近线性。
  • 通过架构和优化改进实现对非常深的 Transformer 风格模型的训练。
  • 展示跨多种数据模态的最先进密度建模。

提出的方法

  • 引入因式化的自注意力,使每个位置仅对先前位置的一个稀疏子集进行注意。
  • 探索二维因式化注意力模式:跨步(strided)和固定模式,具有可控的局部性和覆盖范围。
  • 结合带有预激活残差和层归一化的 Sparse Transformer 块,以支持深层网络。
  • 在反向传播过程中重新计算注意力和前馈块以节省内存。
  • 实现混合稀疏注意力、局部窗口和分块计算的高效 GPU 内核。
  • 采用混合精度、带预热的 Adam、余弦学习率衰减和梯度裁剪进行训练。

实验结果

研究问题

  • RQ1Can sparse, factorized attention match full attention on long sequences across text, images, and audio?
  • RQ2What sparse patterns (strided vs fixed) yield best performance for different data modalities?
  • RQ3How deep can Sparse Transformers be trained, and what memory/training techniques enable it?
  • RQ4What is the impact of these patterns on density modeling benchmarks and sample quality?

主要发现

模型数据集 / 任务每字节比特数
PixelCNNCIFAR-103.03
PixelCNN++CIFAR-102.92
Image TransformerCIFAR-102.90
PixelSNAILCIFAR-102.85
Sparse Transformer 59M (strided)CIFAR-102.80
Deeper Self-Attention (Al-Rfou et al., 2018)Enwik81.06
Transformer-XL 88M (Dai et al., 2018)Enwik81.03
Transformer-XL 277M (Dai et al., 2018)Enwik80.99
Sparse Transformer 95M (fixed)Enwik80.99
PixelCNN (Oord et al., 2016)ImageNet 64x643.57
Parallel Multiscale (Reed et al., 2017)ImageNet 64x643.70
Glow (Kingma & Dhariariwal, 2018)ImageNet 64x643.81
SPN (Menick & Kalchbrenner, 2018)ImageNet 64x643.52
Sparse Transformer 152M (strided)ImageNet 64x643.44
Sparse Transformer 152M (strided)Classical music (audio)1.97
  • Sparse Transformers achieve comparable or better density modeling performance than dense attention across CIFAR-10, Enwik8, ImageNet-64, and music data.
  • Strided and fixed sparse patterns provide substantial speedups over dense attention and, in some cases, better compression (lower bits per byte).
  • Models with hundreds of layers can be trained by architectural changes and gradient recomputation, enabling long-context modeling.
  • On CIFAR-10, strided sparse attention reaches 2.80–2.82 bits per dim, beating prior state-of-the-art (2.85).
  • On Enwik8, Sparse Transformer with fixed attention reaches 0.99 bits per dim, matching or improving Transformer-XL with more parameters.
  • On ImageNet-64, the strided sparse transformer achieves 3.44 bits per dim, better than several prior generative models.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。