[论文解读] Dissecting Linear Recurrent Models: How Different Gating Strategies Drive Selectivity and Generalization
本文提出 SelectivBench,一个轻量级的合成基准,用于解析线性递归记忆(LRM)模型,揭示门控、快速遗忘和通道混合如何影响选择性和泛化,并在多种 LRM 变体与 Transformer 之间进行对比。
Linear recurrent neural networks have emerged as efficient alternatives to the original Transformer's softmax attention mechanism, thanks to their highly parallelizable training and constant memory and computation requirements at inference. Iterative refinements of these models have introduced an increasing number of architectural mechanisms, leading to increased complexity and computational costs. Nevertheless, systematic direct comparisons among these models remain limited. Existing benchmark tasks are either too simplistic to reveal substantial differences or excessively resource-intensive for experimentation. In this work, we propose a refined taxonomy of linear recurrent models and introduce SelectivBench, a set of lightweight and customizable synthetic benchmark tasks for systematically evaluating sequence models. SelectivBench specifically evaluates selectivity in sequence models at small to medium scale, such as the capacity to focus on relevant inputs while ignoring context-based distractors. It employs rule-based grammars to generate sequences with adjustable complexity, incorporating irregular gaps that intentionally violate transition rules. Evaluations of linear recurrent models on SelectivBench reveal performance patterns consistent with results from large-scale language tasks. Our analysis clarifies the roles of essential architectural features: gating and rapid forgetting mechanisms facilitate recall, in-state channel mixing is unnecessary for selectivity, but critical for generalization, and softmax attention remains dominant due to its memory capacity scaling with sequence length. Our benchmark enables targeted, efficient exploration of linear recurrent models and provides a controlled setting for studying behaviors observed in large-scale evaluations. Code is available at https://github.com/symseqbench/selectivbench
研究动机与目标
- 定义改进的线性递归记忆(LRM)模型分类体系。
- 引入 SelectivBench,系统评估合成任务中的选择性和泛化。
- 评估结构特征(门控、遗忘、通道混合)对跨任务性能的影响。
- 在受控的合成文法上将 LRMs 与 Transformer 基线进行比较,以理解记忆与干扰项处理。
提出的方法
- 将 LRMs 正式化为逐元素状态更新和数据相关门控(A_t、B_t、C_t)。
- 引入互补门控和输入/输出门控,研究协同写入/读取行为。
- 在 SymSeqBench 基础上扩展 SelectivBench 任务,插入间隙/噪声和非语法标记以探测选择性。
- 使用可控拓扑熵(TE)的人工语法序列来调节任务难度。
- 在测量记忆、噪声排除、上下文感知选择性和长度泛化等任务上对模型进行评估。
- 提供跨模型比较,包括 DeltaNet、GLA、Mamba、Mamba2、带门控的 Delta 变体和 Transformer。
实验结果
研究问题
- RQ1不同的 LRMs 门控策略(包括互补门控)如何影响选择性和记忆回忆?
- RQ2通道混合在 LRM 中处理干扰项和跨序列长度泛化的作用是什么?
- RQ3合成的 SelectivBench 任务是否能揭示在大型语言任务中出现的趋势,跨 LRM 与 Transformer 的表现一致性?
- RQ4在有噪声和非语法间隙的记忆密集任务上,LRM 与 Transformer 的比较如何?
- RQ5门控和快速遗忘机制在对间隙鲁棒性和长度泛化中的贡献有多大?
主要发现
| Model | # Gate Params | Nb. Params (M) | Task 1 Accuracy | Task 2 Accuracy | Task 3 Accuracy |
|---|---|---|---|---|---|
| DeltaNet | d×N_heads + d^2 | 83 | 0.50 ± 0.01 | 0.41 ± 0.01 | 0.35 ± 0.001 |
| GLA | ~ d×16 | 87 | 0.75 ± 0.01 | 0.68 ± 0.05 | 0.50 ± 0.01 |
| Mamba | 2d×d_state + 8d^2/16 | 90 (71 in task 2) | 0.89 ± 0.009 | 0.71 ± 0.01 | 0.63 ± 0.08 |
| Mamba2 | ~ 2d×N_heads | 85 (67 in task 2) | 0.92 ± 0.007 | 0.67 ± 0.01 | 0.64 ± 0.01 |
| Gated DeltaNet | 2d_gate^2? | 87 | 0.87 ± 0.01 | 0.76 ± 0.01 | 0.57 ± 0.008 |
| Gated DeltaProduct | 2d_gate^2? | 82 | 0.87 ± 0.008 | 0.74 ± 0.02 | 0.61 ± 0.01 |
| Transformer | - | 78 | 0.86 ± 0.005 | 0.77 ± 0.01 | 0.67 ± 0.07 |
- 门控和快速遗忘机制在多任务中帮助了 LRM 的回忆。
- 状态内通道混合并非选择性所必需,但对泛化至关重要。
- Softmax 注意力在需要长记忆和扩展上下文的任务中仍然占优。
- 具有互补门控的 LRM(如 Mamba/Mamba2 变体)在含干扰项的选择性任务中表现优于其他方法。
- 带门控的 DeltaNet 和 DeltaProduct 对更长间隙具有强泛化能力,表明通道混合有助于记忆鲁棒性。
- Transformer 在部分任务上表现良好,但在对更长间隙的外推方面存在泛化差距,揭示了在极长上下文中的泛化挑战。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。