QUICK REVIEW

[论文解读] Were RNNs All We Needed?

Leo Feng, Frederick Tung|arXiv (Cornell University)|Oct 2, 2024

Nursing Education, Practice, and Leadership被引用 5

一句话总结

本文通过去除隐藏状态依赖重新审视 LSTMs/GRUs 以实现并行训练，引入 minLSTM 和 minGRU，并展示它们在各种任务中达到或超过最近的序列模型，同时显著提升训练速度。

ABSTRACT

The introduction of Transformers in 2017 reshaped the landscape of deep learning. Originally proposed for sequence modelling, Transformers have since achieved widespread success across various domains. However, the scalability limitations of Transformers - particularly with respect to sequence length - have sparked renewed interest in novel recurrent models that are parallelizable during training, offer comparable performance, and scale more effectively. In this work, we revisit sequence modelling from a historical perspective, focusing on Recurrent Neural Networks (RNNs), which dominated the field for two decades before the rise of Transformers. Specifically, we examine LSTMs (1997) and GRUs (2014). We demonstrate that by simplifying these models, we can derive minimal versions (minLSTMs and minGRUs) that (1) use fewer parameters than their traditional counterparts, (2) are fully parallelizable during training, and (3) achieve surprisingly competitive performance on a range of tasks, rivalling recent models including Transformers.

研究动机与目标

在可并行化训练的背景下重新评估传统 RNNs (LSTMs/GRUs)。
开发最小化、参数高效的变体，移除隐藏状态依赖。
证明 minLSTM 和 minGRU 可以并行训练并在大幅加速的同时，在多种任务上与现代序列模型相匹配。

提出的方法

重写 LSTM/GRU 闸门以去除对 h_{t-1} 的依赖，以适应并行前缀扫描的公式。
移除基于 tanh 的输出/状态范围限制，以稳定化和归一化输出。
推导出 minGRU 和 minLSTM，参数显著更少，并通过并行前缀扫描算法实现并行训练。
在多种任务（合成数据、强化学习、语言模型）中实证比较 minGRU/minLSTM 与 GRU/LSTM、以及如 Mamba 等最近模型。

实验结果

研究问题

RQ1经典的 LSTM/GRU 架构是否可以重新表述以实现并行训练而无需时间反向传播？
RQ2最小变体（minGRU/minLSTM）在与 Transformer 和最先进循环模型相比时，能否在参数显著减少且并行训练的情况下实现具有竞争力的性能？
RQ3移除隐藏状态依赖和输出范围约束在速度、内存与稳定性方面的权衡如何？
RQ4最小 RNN 是否能扩展到用于基准现代序列模型的任务（Selective Copying、D4RL RL 任务、语言建模）？

主要发现

minGRU 和 minLSTM 通过并行扫描算法实现并行训练，在长度 512 的序列中相对于传统 RNN 的实验中达到约 175× 的加速，在 LSTM（长度 512）时约 235× 的加速。
最小模型使用显著更少的参数（例如 minGRU ~ O(2 d_h d_x) vs GRU ~ O(3 d_h (d_x + d_h)); minLSTM ~ O(3 d_h d_x) vs LSTM ~ O(4 d_h (d_x + d_h))）。
在训练/运行时比较中，minGRU/minLSTM 的运行时表现与 Mamba 相当， dramatically faster than traditional RNNs，在序列长度 512 时：2.97 ms (minLSTM), 2.72 ms (minGRU), 2.71 ms (Mamba)。
对于更长的序列（长度 4096），minGRU 和 minLSTM 维持大幅加速（与其最小对应版本相比，GRU/LSTM 分别快 1324× 和 1361×）。
在 Selective Copying 和 D4RL RL 基准上，minGRU/minLSTM 在任务上达到与 S4、Hyena、Transformer 基线相当或更优的水平，并且在若干数据集的平均性能上超过 Decision S4。
在 Shakespeare 语言建模上，minGRU/minLSTM 的测试损失接近 Mamba 和 Transformer，后者需要显著更多的训练步骤来达到类似性能（大约多 2.5 倍步骤数量）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。