QUICK REVIEW

[논문 리뷰] Retrieval meets Long Context Large Language Models

Peng Xu, Wei Ping|arXiv (Cornell University)|2023. 10. 04.

Topic Modeling인용 수 14

한 줄 요약

이 논문은 검색-증강 및Long-context LLMs(4K, 16K, 32K)를Long-context 작업에서 비교하며, 검색이 짧은 맥락과 긴 맥락 모두를 향상시키고 4K + 검색이 더 적은 계산으로 16K/32K 수준의 성능과 대등해질 수 있음을 보여줍니다. 또한 여러 작업에서 OpenAI API를 능가하는 강력한 검색-증강 Llama2-70B-32k 모델의 성능을 입증합니다.

ABSTRACT

Extending the context window of large language models (LLMs) is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i) Retrieval-augmentation versus long context window, which one is better for downstream tasks? ii) Can both methods be combined to get the best of both worlds? In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes. Our best model, retrieval-augmented Llama2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context tasks including question answering, query-based summarization, and in-context few-shot learning tasks. It also outperforms its non-retrieval Llama2-70B-32k baseline by a margin, while being much faster at generation. Our study provides general insights on the choice of retrieval-augmentation versus long context extension of LLM for practitioners.

연구 동기 및 목표

검색 보강 또는 긴 맥락 확장이 긴 맥락 작업에서 더 나은 성능을 제공하는지 평가합니다.
다양한 맥락 창(4K, 16K, 32K)에서 검색이 모델에 미치는 영향을 정량화합니다.
질의응답, 요약, 맥락 내 학습 등에서 검색+긴 맥context의 결합이 얻을 수 있는 이점을 평가합니다.

제안 방법

4K, 16K, 32K 맥 context 창을 가진 두 개의 43B 및 70B 디코더-전용 LLM(GPT-43B, Llama2-70B)을 비교합니다.
RoPE 임베딩에 대한 위치 보간으로 16K/32K까지 맥 context를 확장합니다.
세 개의 리트리버(Dargon, Contriever, OpenAI 임베딩)을 사용해 상위-N 청크를 검색하고 이를 모델에 증거로 제공합니다.
프롬프트를 따를 수 있도록 다양한 지시 데이터셋을 혼합한 지시문-튜닝을 수행합니다.
QM, QASP, NQA, QLTY, MSQ, HQA, MFQA를 포함한 일곱 개의 긴 맥context 데이터셋에서 제로샷 및 소수샷 태스크를 평가합니다.

실험 결과

연구 질문

RQ1검색 보강이 순수하게 맥 context 창을 확장하는 것에 비해 긴 맥 context LLM의 성능을 향상시키나요?
RQ2검색-보강 4K 맥context 모델이 16K/32K 맥 context 모델의 정확도와 효율성에 비슷하게 맞추거나 능가할 수 있나요?
RQ3맥 context 창의 크기와 검색된 청크 수가 서로 다른 모델 크기에서 하류 태스크에 어떤 영향을 미치나요?
RQ4대형 맥 context LLM과 함께 사용할 때 서로 다른 리트리버는 어떻게 비교되나요?
RQ5검색 보강 대형 맥 context 모델이 기존 OpenAI 모델을 긴 맥 context 벤치마크에서 능가할 수 있나요?

주요 결과

모델	Seq len.	Avg.	QM	QASP	NQA	QLTY	MSQ	HQA	MFQA
GPT-43B	4k	26.44	15.56	23.66	15.64	49.35	11.08	28.91	40.90
GPT-43B + ret	4k	29.32	16.60	23.45	19.81	51.55	14.95	34.26	44.63
GPT-43B	16k	29.45	16.09	25.75	16.94	50.05	14.74	37.48	45.08
GPT-43B + ret	16k	29.65	15.69	23.82	21.11	47.90	15.52	36.14	47.39
Llama2-70B	4k	31.61	16.34	27.70	19.07	63.55	15.40	34.64	44.55
Llama2-70B + ret	4k	36.02	17.41	28.74	23.41	70.15	21.39	42.06	48.96
Llama2-70B	16k	36.78	16.72	30.92	22.32	76.10	18.78	43.97	48.63
Llama2-70B + ret	16k	37.23	18.70	29.54	23.12	70.90	23.28	44.81	50.24
Llama2-70B	32k	37.36	15.37	31.88	23.59	73.80	19.07	49.49	48.35
Llama2-70B + ret	32k	39.60	18.34	31.27	24.53	69.55	26.72	53.89	52.91
Llama2-7B	4k	22.65	14.25	22.07	14.38	40.90	8.66	23.13	35.20
Llama2-7B + ret	4k	26.04	16.45	22.97	18.18	43.25	14.68	26.62	40.10
Llama2-7B	32k	28.20	16.09	23.66	19.07	44.50	15.74	31.63	46.71
Llama2-7B + ret	32k	27.63	17.11	23.25	19.12	43.70	15.67	29.55	45.03

검색은 평가된 모든 태스크에서 4K 및 16K/32K 맥 context LLM의 성능을 크게 향상시킵니다.
검색이 있는 4K 맥 context LLM은 상당히 적은 계산으로 16K 긴 맥 context LLM에 비해 평균 성능에 근접할 수 있습니다(GPT-43B: 29.32 vs 29.45; Llama2-70B: 36.02 vs 36.78).
검색-보강 Llama2-70B-32k-ret(32K 맥 context)은 nine 긴 맥 context 태스크에서 평균적으로 GPT-3.5-turbo-16k 및 Davinci-003를 능가합니다(예: 평균 점수 43.6 vs 42.8 및 40.9 구조).
검색 보강은 긴 맥 context 모델을 더욱 향상시키며, Llama2-70B-32k-ret은 비검색 기준선보다 더 높은 평균을 달성하고(표 3의 39.60 vs 37.36) 경우에 따라 생성 속도가 더 빠릅니다.
검색의 이점은 Dragon, Contriever, OpenAI 임베딩 등 여러 리트리버에서 관찰되며 짧은 맥 context와 긴 맥 context 설정 모두에서 지속됩니다.
상위 5개를 넘는 상위 10/20개의 검색 청크 수를 늘리는 것이 일관되게 성능을 향상시키지 못하고 중간 지점에서의 손실 효과로 인해 악화될 수 있습니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.