QUICK REVIEW

[论文解读] Retrieval meets Long Context Large Language Models

Peng Xu, Wei Ping|arXiv (Cornell University)|Oct 4, 2023

Topic Modeling被引用 14

一句话总结

本文比较检索增强和长上下文LLMs（4K、16K、32K）在长上下文任务上的表现，结果显示检索能提升短上下文和长上下文的表现，且4K带检索在计算量更少的情况下可达到16K/32K的性能。此外，还展示了强大的检索增强的 Llama2-70B-32k 模型，在若干任务上优于 OpenAI 的 API。

ABSTRACT

Extending the context window of large language models (LLMs) is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i) Retrieval-augmentation versus long context window, which one is better for downstream tasks? ii) Can both methods be combined to get the best of both worlds? In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes. Our best model, retrieval-augmented Llama2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context tasks including question answering, query-based summarization, and in-context few-shot learning tasks. It also outperforms its non-retrieval Llama2-70B-32k baseline by a margin, while being much faster at generation. Our study provides general insights on the choice of retrieval-augmentation versus long context extension of LLM for practitioners.

研究动机与目标

评估检索增强是否比扩展长上下文更能提升在长上下文任务上的性能。
量化检索对具有不同上下文窗口（4K、16K、32K）的模型的影响。
评估将检索与长上下文结合是否能在问答、摘要和就地学习等任务上带来提升。

提出的方法

比较两种解码器型LLM（GPT-43B、Llama2-70B）在4K、16K、32K上下文窗口下的表现。
通过对RoPE嵌入进行位置插值，将上下文扩展到16K/32K。
使用三种检索器（Dragon、Contriever、OpenAI embeddings）获取前N个区块并作为证据喂给模型。
在混合指令数据集上对模型进行指令微调，以便能够遵循提示。
在包括 QM、QASP、NQA、QLTY、MSQ、HQA、MFQA 的七个长上下文数据集上评估零-shot与少-shot任务。

实验结果

研究问题

RQ1相较于单纯扩大上下文窗口，检索增强是否能提升长上下文LLM的性能？
RQ2检索增强的4K上下文模型是否在准确性和效率上与16K/32K上下文模型相匹配或超越？
RQ3上下文窗口大小和检索到的区块数量对不同模型尺寸的下游任务有何影响？
RQ4在大上下文LLMs中，不同检索器的比较结果如何？
RQ5检索增强的大上下文模型是否能在长上下文基准测试中超越现有的 OpenAI 模型？

主要发现

模型	序列长度	平均值	QM	QASP	NQA	QLTY	MSQ	HQA	MFQA
GPT-43B	4k	26.44	15.56	23.66	15.64	49.35	11.08	28.91	40.90
GPT-43B + ret	4k	29.32	16.60	23.45	19.81	51.55	14.95	34.26	44.63
GPT-43B	16k	29.45	16.09	25.75	16.94	50.05	14.74	37.48	45.08
GPT-43B + ret	16k	29.65	15.69	23.82	21.11	47.90	15.52	36.14	47.39
Llama2-70B	4k	31.61	16.34	27.70	19.07	63.55	15.40	34.64	44.55
Llama2-70B + ret	4k	36.02	17.41	28.74	23.41	70.15	21.39	42.06	48.96
Llama2-70B	16k	36.78	16.72	30.92	22.32	76.10	18.78	43.97	48.63
Llama2-70B + ret	16k	37.23	18.70	29.54	23.12	70.90	23.28	44.81	50.24
Llama2-70B	32k	37.36	15.37	31.88	23.59	73.80	19.07	49.49	48.35
Llama2-70B + ret	32k	39.60	18.34	31.27	24.53	69.55	26.72	53.89	52.91
Llama2-7B	4k	22.65	14.25	22.07	14.38	40.90	8.66	23.13	35.20
Llama2-7B + ret	4k	26.04	16.45	22.97	18.18	43.25	14.68	26.62	40.10
Llama2-7B	32k	28.20	16.09	23.66	19.07	44.50	15.74	31.63	46.71
Llama2-7B + ret	32k	27.63	17.11	23.25	19.12	43.70	15.67	29.55	45.03

检索在评估任务中显著提升了4K和16K/32K上下文LLMs的性能。
带检索的4K上下文LLM在平均性能上可与16K长上下文LLM相当（GPT-43B: 29.32 与 29.45；Llama2-70B: 36.02 与 36.78），且计算量要小得多。
检索增强的 Llama2-70B-32k-ret（32K上下文）在九个长上下文任务上的平均分超越 GPT-3.5-turbo-16k 和 Davinci-003（例如平均分 43.6 对 42.8 和基线 40.9）。
检索进一步提升长上下文模型，Llama2-70B-32k-ret 的平均分高于非检索基线（Table 3 中为 39.60 对 37.36），且在某些情况下生成速度更快。
检索的好处在多种检索器（Dragon、Contriever、OpenAI embeddings）上都能观察到，并且在短/长上下文设置中均持续存在。
将检索的区块数量超过前5（至前10/前20）并不一定提升性能，甚至可能因为中间丢失效应而降低。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。