QUICK REVIEW

[论文解读] Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia|arXiv (Cornell University)|Oct 3, 2023

Topic Modeling被引用 12

一句话总结

Ring Attention 通过在环形拓扑中将分块注意力/前馈计算与跨设备通信重叠，使上下文长度随设备数量扩展，可实现近似无限上下文而不需要近似。

ABSTRACT

Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby posing challenges in utilizing videos, actions, and other long-form sequences and modalities in complex environments. We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers, without resorting to approximations or incurring additional communication and computation overheads. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of our approach in allowing millions of tokens context size and improving performance.

研究动机与目标

激发并解决长上下文 Transformer 的内存瓶颈。
提出一个基于环的分块计算方案，将长序列在设备之间分布。
证明将密钥-值块通信与计算重叠可消除开销。
展示可扩展到数百万令牌以及在语言建模和强化学习任务上的设备数量扩展。

提出的方法

分块注意力和前馈计算将序列长度在多台设备间分布。
环形拓扑协调主机；每台主机处理一个查询块，而密钥-值块轮换到下一台/上一台主机。
将密钥-值块的通信与分块计算重叠，以隐藏通信潜伏。
使用分块并行 transformer 使内存成本随分块大小线性且与序列长度无关。
算法 1 概述了在 FSDP 和环形通信下的环基 Transformer 训练的内存降低步骤。
实现利用内存高效的注意力原语和无需近似的真实分块操作。

实验结果

研究问题

RQ1Ring Attention 是否能使 Transformer 的上下文长度随设备数量线性扩展，同时保持性能？
RQ2将分块注意力分布在一个设备环上时，内存与计算的权衡是什么？
RQ3Ring Attention 如何影响在不同模型规模和硬件（GPU/TPU）下的模型 FLOPs 利用率和吞吐量？
RQ4Ring Attention 是否提升对长上下文有益的下游任务，如强化学习和长上下文语言建模？

主要发现

Ring Attention 使训练序列长度达到比现有内存高效方法多出设备数量倍。
上下文长度可超过数百万令牌且无需近似或额外开销。
MFU（模型 FLOPs 利用率）在非常长的上下文长度下仍然很高，与某些基线不同。
在 ExoRL 的强化学习实验中，当使用更长的轨迹/上下文时，Ring Attention 相对于基线在多个任务上提高了平均回报。
使用 Ring Attention 对 512K-token 上下文进行微调 LLaMA-13B，在长上下文行检索任务上保持高准确率，优于若干短上下文基线。
在硬件（A100 GPU 和 TPU）上，Ring Attention 相较于普通/内存高效 transformer 在上下文长度扩展方面显示出显著的扩展性，且开销较小。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。