[论文解读] Long Range Arena: A Benchmark for Efficient Transformers
本文提出 Long Range Arena (LRA),一个统一基准,用于在1K–16K代币的长上下文任务上评估高效 Transformer,比较十种模型在多样化数据类型和任务上的表现。它分析了性能、速度和内存以突出权衡,并指出不存在单一最佳解。
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, LRA, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at https://github.com/google-research/long-range-arena.
研究动机与目标
- 建立一个统一的、通用的长距离 Transformer 模型基准,覆盖多种数据模态。
- 在长上下文挑战下评估广泛的高效 Transformer 架构。
- 提供全面的效率(速度与内存)分析,以指导模型选择和未来研究。
提出的方法
- 设计一套长上下文任务(ListOps、字节级文本分类、字节级文档检索、序列中的图像分类、Pathfinder 和 Pathfinder-X)。
- 在任务上评估十种高效 Transformer 模型(Reformer、Linformer、Linear Transformers、Sparse Transformers、Longformer、Sinkhorn Transformers、Synthesizers、BigBird、Performers、以及 vanilla Transformer)。
- 量化所需注意力跨度并报告逐任务及总体性能。
- 提供基于 JAX/Flax 的开源基准代码,便于复现与扩展。
实验结果
研究问题
- RQ1不同高效 Transformer 架构在文本、图像和合成数据的长距离任务上的表现如何?
- RQ2这些架构在长序列长度下的速度和内存权衡是什么?
- RQ3是否存在在所有长距离任务中都稳定出色的单一模型,还是权衡占主导?
- RQ4增大序列长度(如 Pathfinder-X)如何影响各模型的学习能力?
主要发现
| 模型 | ListOps | 文本 | 检索 | 图像 | Pathfinder | Path-X | 平均 |
|---|---|---|---|---|---|---|---|
| Transformer | 36.37 | 64.27 | 57.46 | 42.44 | 71.40 | FAIL | 54.39 |
| Local Attention | 15.82 | 52.98 | 53.39 | 41.46 | 66.63 | FAIL | 46.06 |
| Sparse Trans. | 17.07 | 63.58 | 59.59 | 44.24 | 71.71 | FAIL | 51.24 |
| Longformer | 35.63 | 62.85 | 56.89 | 42.22 | 69.71 | FAIL | 53.46 |
| Linformer | 35.70 | 53.94 | 52.27 | 38.56 | 76.34 | FAIL | 51.36 |
| Reformer | 37.27 | 56.10 | 53.40 | 38.07 | 68.50 | FAIL | 50.67 |
| Sinkhorn Trans. | 33.67 | 61.20 | 53.83 | 41.23 | 67.45 | FAIL | 51.39 |
| Synthesizer | 36.99 | 61.68 | 54.67 | 41.61 | 69.45 | FAIL | 52.88 |
| BigBird | 36.05 | 64.02 | 59.29 | 40.83 | 74.87 | FAIL | 55.01 |
| Linear Trans. | 16.13 | 65.90 | 53.09 | 42.34 | 75.30 | FAIL | 50.55 |
| Performer | 18.01 | 65.40 | 53.82 | 42.77 | 77.05 | FAIL | 51.41 |
| Task Avg (Std) | 29 (9.7) | 61 (4.6) | 55 (2.6) | 41 (1.8) | 72 (3.7) | FAIL | 52 (2.4) |
- 所有 LRA 任务对当前模型都具有挑战性,在若干任务上与最佳性能存在显著差距。
- BigBird 通过在各任务间的平衡实现了最佳的整体 LRA 得分,尽管在某些单独任务上并非第一。
- 基于核的变体如 Performer 和 Linear Transformers 在速度/内存权衡方面表现强劲,有时以牺牲任务特定准确度为代价。
- 大多数模型在极端长度(Path-X)上表现困难,未能解决,凸显当前架构在超长序列上的局限。
- 不存在一刀切的解决方案;准确性、速度和内存之间的权衡因任务和模型而异。
- 内存占用差异巨大;Linformer 在4K时几乎可达到1 GB的设备内存,而 vanilla Transformer 在4K 时可能需要约9.48 GB,凸显效率差距。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。