QUICK REVIEW

[论文解读] Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel|arXiv (Cornell University)|Oct 31, 2021

Topic Modeling参考文献 41被引用 481

一句话总结

S4 引入一个结构化状态空间序列模型，能够高效处理超长序列，并在长程依赖基准测试中达到最先进的结果，包括解决 Path-X 并显著加速生成相较于 Transformer。

ABSTRACT

A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) $ x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) $, and showed that for appropriate choices of the state matrix $ A $, this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning $ A $ with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91\% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation $60 imes$ faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.

研究动机与目标

激励需要在跨模态和跨任务中处理长程依赖的模型。
提出一种实用、高效的基于状态空间模型（SSM）的序列模型，能够扩展到非常长的序列。
展示 normal-plus-low-rank（NPLR）参数化使得计算快速且训练稳定。
展示 S4 在图像、文本和语音基准上的性能以及与 Transformer 的竞争力。

提出的方法

将状态空间矩阵 A 重新参数化为 normal plus low-rank（NPLR）形式，以实现稳定对角化。
通过共轭到对角形式并对低秩修正应用 Woodbury 恒等式，来高效计算离散 SSM 核。
将 SSM 卷积核表示为 Cauchy 核，并通过在单位根处采样的截断生成函数进行求值，随后进行逆FFT。
利用基于 HiPPO 的连续时间记忆理论来处理长程依赖。
提供一种跨特征共享参数的架构（H 个独立副本），并为多特征输入使用类似深度卷积的广播方法。

实验结果

研究问题

RQ1使用 S4 参数化的 SSM 能否在很长序列（长度可达 16k 及以上）上高效建模，同时在标准基准上达到或超过 Transformer 的性能？
RQ2在语言和图像建模中，免注意力或低注意力模型能在多大程度上接近 Transformer，同时提供更快的生成？
RQ3基于 SSM 的模型在跨域（图像、文本、语音）上是否能在最小的架构改动下实现泛化？
RQ4NPLR S4 参数化为递归和卷积表示提供了哪些理论与计算保证（复杂度、稳定性）？

主要发现

S4 在顺序 CIFAR-10 上在没有数据增强或辅助损失的情况下达到 91% 的准确率，与更大的 2-D ResNet 相当。
S4 在图像和语言建模任务上显著缩小与 Transformer 的差距，同时实现生成速度约快 60 倍。
S4 在 Long Range Arena 任务上创造了最先进的结果，包括以 88% 的准确率解决 Path-X（长度 16k）（前期工作为随机猜测）。
在长度 16000 的语音分类任务中，S4 将测试误差降至 1.7%，超过专门的 Speech CNN，并超过基线。
WikiText-103 语言建模显示 S4 的困惑度比 Transformer 基线低 0.8，体现了免注意力的竞争力。
S4 展现出快速自回归生成、跨领域适用性（图像、文本、语音），以及对采样率变化的鲁棒性且无需再训练。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。