QUICK REVIEW

[论文解读] LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

Hongye Jin, Xiaotian Han|arXiv (Cornell University)|Jan 2, 2024

Handwritten Text Recognition Techniques被引用 13

一句话总结

SelfExtend 在推理时无需微调，通过将远距离标记应用分组注意力与对近距离标记使用标准邻近注意力的组合，结合 floor-based relpos 映射，扩展 LLM 的上下文窗口。

ABSTRACT

It is well known that LLMs cannot generalize well to long contexts whose lengths are larger than the training sequence length. This poses challenges when employing LLMs for processing long input sequences during inference. In this work, we argue that LLMs themselves have inherent capabilities to handle long contexts without fine-tuning. To achieve this goal, we propose SelfExtend to extend the context window of LLMs by constructing bi-level attention information: the grouped attention and the neighbor attention. The grouped attention captures the dependencies among tokens that are far apart, while neighbor attention captures dependencies among adjacent tokens within a specified range. The two-level attentions are computed based on the original model's self-attention mechanism during inference. With minor code modification, our SelfExtend can effortlessly extend existing LLMs' context window without any fine-tuning. We conduct comprehensive experiments on multiple benchmarks and the results show that our SelfExtend can effectively extend existing LLMs' context window length. The code can be found at \url{https://github.com/datamllab/LongLM}.

研究动机与目标

激励/说明 LLMs 具备固有的长上下文能力，尽管预训练存在限制。
解决在长上下文推理中相对位置编码的分布外问题。
提出一种推理时无需微调即可扩展上下文长度的机制。
在语言建模、合成长上下文和真实世界长上下文任务上评估 SelfExtend。

提出的方法

引入对远距标记应用 floor 除法位置映射的分组注意力。
在定义的窗口内保留对邻近标记的标准注意力。
在 softmax 之前合并分组注意力与邻近注意力，形成 SelfExtend 注意力。
提供一个梯度无关、可插入的推理时修改，无需微调。
推导扩展的上下文长度公式，以量化 SelfExtend 下可达到的最大长度。

Figure 1: Illustration of grouped attention. We suppose that the LLM’s pretraining context window length is $5$ and the length of the inference sequence is $8$ . On the left figure, we show the positional Out-of-Distribution (O.O.D.) issue while the input length is out of the pretraining context win

实验结果

研究问题

RQ1LLMs 是否本身就能在不进行微调的情况下处理超出预训练的更长上下文？
RQ2如何将未见的大相对位置映射到已知位置以保持连贯性？
RQ3SelfExtend 是否在不降低短上下文性能的前提下提升跨多个模型与任务的长上下文性能？

主要发现

使用分组注意力时，SelfExtend 在预训练上下文窗口之外保持低困惑度（perplexity）。
Passkey 检索任务在不同深度和上下文下显示 100% 准确性，表明真正的长上下文访问。
在真实世界长上下文基准测试中，SelfExtend 的表现与基于微调的扩展相竞争甚至优越。
SelfExtend 在提升长上下文能力的同时保持短上下文任务性能，允许无训练的插件式接入。
在多样化模型（Llama-2、Mistral、SOLAR、Phi-2）上的实验显示该方法的广泛适用性。

Figure 2: Perplexity (PPL) using grouped attention with different group sizes under different sequence lengths on PG-19 dataset. The original Llama-2-7b-chat PPL is stable at 4k (4096) sequences (red dotted line) but explodes at 6k (6144) sequences (purple dotted line). The results show the LLMs kee

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。