Skip to main content
QUICK REVIEW

[论文解读] PRISM: Parallel Residual Iterative Sequence Model

Jie Jiang, Ke Cheng|arXiv (Cornell University)|Feb 11, 2026
Parallel Computing and Optimization Techniques被引用 0
一句话总结

PRISM 引入一个可并行化的 amortized 残差优化框架,模仿多步迭代细化,结合线性注意力实现高表达能力,且比显式优化方法的吞吐量最高可提升至 174 倍。

ABSTRACT

Generative sequence modeling faces a fundamental tension between the expressivity of Transformers and the efficiency of linear sequence models. Existing efficient architectures are theoretically bounded by shallow, single-step linear updates, while powerful iterative methods like Test-Time Training (TTT) break hardware parallelism due to state-dependent gradients. We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension. PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form. We employ a Write-Forget Decoupling strategy that isolates non-linearity within the injection operator. To bypass the serial dependency of explicit solvers, PRISM utilizes a two-stage proxy architecture: a short-convolution anchors the initial residual using local history energy, while a learned predictor estimates the refinement updates directly from the input. This design distills structural patterns associated with iterative correction into a parallelizable feedforward operator. Theoretically, we prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck. Empirically, it achieves comparable performance to explicit optimization methods while achieving 174x higher throughput.

研究动机与目标

  • 解决线性注意在长序列上的表达瓶颈。
  • 弥合高效线性模型与表达性优化方法之间的差距。
  • 开发一个硬件感知的架构,允许并行的多步细化。
  • 理论刻画秩积累(Rank Accumulation)与写忘解耦(Write-Forget Decoupling)原理。
  • 在长序列推荐基准上与强基线进行经验验证,验证 PRISM 的有效性。

提出的方法

  • 提出 Write-Forget Decoupling 以保持遗忘动态的低秩,并将高秩非线性细化推入注入项。
  • 引入以输入为锚点的循环展开架构,具备两阶段代理:短时卷积锚点用于估计 S_{t-1}k_t,学习预测器生成多步细化。
  • 构造高秩注入 B_t,作为 L 个正交秩-1 分量的和,使用带门控残差更新的迭代细化。
  • 维持与状态无关的遗忘算子 A_t,以在注入累积的 B_t 到循环状态时保持并行扫描效率。
  • 从理论上给出秩积累并分析在谱扰动下遗忘与注入分量的稳定性。
  • 在实验中将 PRISM 与 Transformer、线性基线以及优化求解器进行对比,报告准确率与训练吞吐量。
Figure 1 : The PRISM Architecture. The framework operates in two phases to approximate the Ideal Non-Linear Solver within a parallelizable linear recurrence. Phase 1 (Input-Anchored Simulation): A ShortConv anchor captures the local pre-activation proxy ( $u_{t}\approx S_{t-1}k_{t}$ ). Parallel pred
Figure 1 : The PRISM Architecture. The framework operates in two phases to approximate the Ideal Non-Linear Solver within a parallelizable linear recurrence. Phase 1 (Input-Anchored Simulation): A ShortConv anchor captures the local pre-activation proxy ( $u_{t}\approx S_{t-1}k_{t}$ ). Parallel pred

实验结果

研究问题

  • RQ1摊销的、以输入为锚点的细化是否能够在保持并行性的同时达到显式迭代求解器(如 TTT)的表现?
  • RQ2在长序列上,PRISM 是否能够在不牺牲建模保真度的前提下达到比显式优化方法更高的吞吐量?
  • RQ3迭代的高秩注入对性能是否关键,其组成部分(锚点、增益预测器、迭代深度)是否有显著贡献?

主要发现

  • PRISM 能达到与显式迭代求解器和深度 Transformer 在具有挑战性的基准上的性能相当。
  • PRISM 在训练吞吐量上实现了最高可达 174 倍的提升,相比显式优化方法。
  • PRISM 弥补了与二次型 Transformer 的差距,表明摊销细化的表达能力相当可观。
  • 消融研究表明迭代深度、非线性、锚定与门控都对性能有显著贡献。
  • 机制性探查显示在受限设置下,PRISM 能在非线性任务上超越线性基线。
Figure 2 : Training throughput comparison of 0.13B models on a single H20 GPU.
Figure 2 : Training throughput comparison of 0.13B models on a single H20 GPU.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。