QUICK REVIEW

[论文解读] Scaling Transformer to 1M tokens and beyond with RMT

Aydar Bulatov, Yuri Kuratov|arXiv (Cornell University)|Apr 19, 2023

Topic Modeling被引用 19

一句话总结

这篇论文提出 Recurrent Memory Transformer (RMT)，一种记忆增强、分段级循环方法，使 Transformer 模型通过附加可训练的记忆标记并使用课程学习，在 2 million tokens 级别实现线性计算扩展。

ABSTRACT

A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute. Our approach demonstrates the capability to store information in memory for sequences of up to an unprecedented two million tokens while maintaining high retrieval accuracy. Experiments with language modeling tasks show perplexity improvement as the number of processed input segments increases. These results underscore the effectiveness of our method, which has significant potential to enhance long-term dependency handling in natural language understanding and generation tasks, as well as enable large-scale context processing for memory-intensive applications.

研究动机与目标

演示 memory-augmented segment-level recurrence (RMT) 作为编码器单独和解码器单独的 Transformer 的插件包装器。
证明 RMT 使在推理期间具有线性计算和恒定内存的情况下，能够处理极长的序列（高达 2M tokens）。
开发并基准测试可扩展到百万级 token 上下文的记忆获取/保留任务，以评估记忆操作的泛化能力。
研究 RMT 对长距离语言建模和形式推理任务的影响，以评估跨领域的实际收益。

提出的方法

在不改变原有架构的情况下，向预训练 Transformer 附加基于 token 的记忆模块。
通过将长输入分割为固定大小的段，只在段内执行全注意力来处理长输入，从而实现线性扩展。
在跨段期间对记忆标记进行递归训练，使记忆输出影响后续段。
使用 curriculum learning 逐步将任务长度从单段扩展到多段上下文。
通过合成记忆任务评估记忆操作，并将实验扩展到长程语言建模和类似定理证明的生成。

实验结果

研究问题

RQ1RMT 是否能够将预训练 Transformer 的有效上下文长度扩展到多百万 token 级别，并保持线性计算成本？
RQ2记忆增强的 Transformer 在极长序列中对事实的记忆、检索和推理能力有多好？
RQ3当在逐步变长的分段任务上训练时，记忆增强模型是否能泛化到更长的序列长度？
RQ4RMT 对长文本语言建模和形式证明生成的困惑度和预测质量有何影响？

主要发现

RMT 对于固定段大小，其输入长度呈线性扩展，针对多段输入相比非循环模型，FLOPs 降幅显著（在某些情况下少 FLOPs 多达 295×）。
有记忆时，预训练的 BERT 骨干可以在高达 2,000,000 tokens 内存储并检索信息（4,096 段，每段 512 tokens）。
课程学习提高了稳定性和泛化性能，使在较短任务上训练的模型能够解决显著更长的任务。
在长程语言建模中，带记忆的 RMT 相对于基线在困惑度上有改进，并通过跨段携带记忆在段边界处使预测更稳定。
RMT 展示了基于注意力模式的记忆操作，并且能够在极长序列上泛化记忆检索，表明在适用任务上并不存在将规模扩展超过 2M tokens 的固有技术限制。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。