QUICK REVIEW

[论文解读] RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun|arXiv (Cornell University)|Apr 9, 2024

Topic Modeling被引用 9

一句话总结

简要结论：RULER 引入一个合成基准，包含四个任务类别，用于评估超越检索的长上下文语言模型；结果显示大多数模型在上下文长度增加时性能下降，只有少数在极长的上下文下仍能保持性能。

ABSTRACT

The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.

研究动机与目标

促使对长上下文语言模型在检索任务之外进行更全面评估。
提供一个灵活的基准（RULER），以在不同上下文长度和任务复杂度上测试模型。
研究当前模型如何处理带有长输入的检索、追踪、聚合和问答。
识别故障模式以及影响或促成长上下文理解的因素。

提出的方法

提出 RULER，一个具有四类任务、可配置上下文长度和复杂性的合成基准。
将 needle-in-a-haystack (NIAH) 扩展为多种检索变体（S-NIAH、MK-NIAH、MV-NIAH、MQ-NIAH）。
引入多跳追踪（变量追踪）以测试长上下文中的共指式链接。
添加聚合任务（CWE 和 FWE）以评估跨长输入的信息整合。
结合带干扰信息的问答任务以评估长上下文问答。

Figure 1: In aggregation tasks, we sample words from a vocabulary following the two distributions above. The common words extraction (CWE) samples from uniform distributions. In the frequent words extraction (FWE), the frequency of each word is determined by its rank in the vocabulary and the parame

实验结果

研究问题

RQ1在扩展的上下文长度和干扰项变化下，长上下文语言模型在检索任务上的表现如何？
RQ2模型能否在长上下文中可靠执行多跳追踪和实体追踪？
RQ3模型是否能有效聚合跨越长序列的信息，及 Zipf-like 词分布如何影响这一点？
RQ4在添加干扰项时，上下文大小如何影响问答性能，模型是否会产生幻觉或依赖参数知识？

主要发现

随着上下文长度增加，大多数模型的性能显著下降，尽管在原生 NIAH 上表现良好。
只有部分模型（如 GPT-4、Command-R、Yi-34B、Mixtral）能在极长上下文（32K）保持令人满意的表现，且很多并未达到它们宣称的上下文容量。
非检索任务（多跳追踪、聚合、QA）揭示显著的失败模式，如从上下文中复制信息、依赖参数知识、信息检索不全。
模型规模、训练时的上下文长度和 Transformer 架构会影响长上下文能力，通常大型模型表现更好，非 Transformer 架构落后。
Yi-34B-200K 在输入长度和任务复杂性增加时显著降级，包括答案不完整和定位相关信息困难。
在训练时扩展上下文长度并不普遍提升 RULER 的性能，当外推到未见长度时存在显著下降。
该基准凸显了 vanilla NIAH 无法捕捉的不同故障模式，强调需要更广泛的长上下文评估。

Figure 5: Correlation heatmap among 18 tasks with diverse task configurations. We remove redundant tasks (in red ) and only preserve 13 representative tasks in Ruler . (W: words; N: numbers; U: UUIDs; Full: entire haystack)

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。