QUICK REVIEW

[논문 리뷰] In-Context Retrieval-Augmented Language Models

Ori Ram, Yoav Levine|arXiv (Cornell University)|2023. 01. 31.

Topic Modeling인용 수 12

한 줄 요약

이 논문은 입력에 검색된 문서를 선행하는(In-Context RALM) 방식이 오프 더 셸프 검색기(BM25 등)를 사용할 때 LM을 수정 없이도 큰 개선을 이끌 수 있으며, 추가 이득은 검색된 문서의 LM 지향 재랭킹에서 오는 것을 보여준다.

ABSTRACT

Retrieval-Augmented Language Modeling (RALM) methods, which condition a language model (LM) on relevant documents from a grounding corpus during generation, were shown to significantly improve language modeling performance. In addition, they can mitigate the problem of factually inaccurate text generation and provide natural source attribution mechanism. Existing RALM approaches focus on modifying the LM architecture in order to facilitate the incorporation of external information, significantly complicating deployment. This paper considers a simple alternative, which we dub In-Context RALM: leaving the LM architecture unchanged and prepending grounding documents to the input, without any further training of the LM. We show that In-Context RALM that builds on off-the-shelf general purpose retrievers provides surprisingly large LM gains across model sizes and diverse corpora. We also demonstrate that the document retrieval and ranking mechanism can be specialized to the RALM setting to further boost performance. We conclude that In-Context RALM has considerable potential to increase the prevalence of LM grounding, particularly in settings where a pretrained LM must be used without modification or even via API access.

연구 동기 및 목표

대형 언어 모델(LM)을 아키텍처나 학습을 수정하지 않고도 grounding하기 위한 동기 부여.
다양한 말뭉치와 모델 규모에 걸쳐 간단한 in-context retrieval-augmented 프레임워크를 평가.
LM 성능을 극대화하기 위한 검색 전략과 문서 재랭킹 조사.
오픈 도메인 질의응답에의 적용 가능성을 보여주고 배치상의 이점을 논의.

제안 방법

In-Context RALM 제안: LM 가중치를 변경하지 않고 LM 입력에 검색된 문서를 선행시키기.
생성 중 검색이 얼마나 자주 일어나는지 제어하기 위한 검색 스트라이드 s 사용.
검색기를 위한 접두(prefix) 길이의 일부를 제한하는 검색 쿼리 길이 ell 사용.
다양한 오픈 소스 LM(GPT-2, GPT-Neo/J, OPT, LLaMA)을 대상으로 다섯 개 말뭉치(WikiText-103, RealNews, ArXiv, Stack Exchange, FreeLaw)에서 평가.
희소 검색(BM25) 대 밀집(search) 검색기 비교; 제로샷 설정에서 BM25가 신경망 검색기보다 종종 우수함을 보임.
두 가지 LM 지향 재랭킹 방법 도입: (a) LM을 이용한 제로샷 재랭킹, (b) 도메인 데이터를 학습한 예측 재랭킹으로 최상위 top-k 문서를 선택.
Natural Questions와 TriviaQA를 사용한 오픈 도메인 QA 성능 평가.

실험 결과

연구 질문

RQ1단순히 오프-더-쉘프 LM의 입력에 검색된 문서를 선행하는 것만으로 LM 성능이 얼마나 개선될 수 있는가?
RQ2어떤 검색기 유형과 검색 구성(스트라이드 및 쿼리 길이)이 맥락상 grounding을 최대화하는가?
RQ3LM 지향 재랭킹이 단순 top-1 검색을 넘어 추가 이득을 제공하는가?
RQ4 LM 수정이나 미세 조정 없이 In-Context RALM이 오픈 도메인 QA 작업으로 얼마나 잘 전이되는가?

주요 결과

모델	검색/조회 방식	WikiText-103 (word ppl)	RealNews (token ppl)	ArXiv (token ppl)	Stack Exchange (token ppl)	FreeLaw (token ppl)
GPT-2 S	–	37.5	21.3	12.0	12.8	13.0
GPT-2 S (BM25 § 5)	BM25	29.6	16.1	10.9	11.3	9.6
GPT-2 S (BM25)	BM25	28.6	15.5	10.1	10.6	8.8
GPT-2 S (BM25, Predictive)	BM25	26.8	–	–	–	–
GPT-2 M	–	26.3	15.7	9.3	8.8	9.6
GPT-2 M (BM25)	BM25	21.5	12.4	8.6	8.1	7.4
GPT-2 M (BM25, Zero-shot)	BM25	20.8	12.0	8.0	7.7	6.9
GPT-2 L	–	22.0	13.6	8.4	8.0	8.0
GPT-2 L (BM25)	BM25	18.1	10.9	7.8	7.8	6.8
GPT-2 L (BM25, Zero-shot)	BM25	17.6	10.6	7.3	7.4	6.4
GPT-2 XL	–	20.0	12.4	7.8	8.0	8.0
GPT-2 XL (BM25)	BM25	16.6	10.1	7.2	7.4	6.4
GPT-2 XL (BM25, Zero-shot)	BM25	16.1	9.8	6.8	7.1	6.0

BM25 검색기가 맥락상의 LM grounding에서 밀집(neural) 검색기보다 자주 우수하다.
빈번한 검색(더 작은 스트라이드 s)이 희박 검색보다 어휘 예측 오차(perplexity) 향상에 더 강하며, 실용적 기본값으로 s = 4가 제시된다.
이 설정에서 BM25의 검색 쿼리 길이 ell ≈ 32 토큰이 적합한 지점이다.
오프-더-쉘프 검색기를 사용하는 In-Context RALM은 말뭉치 전반에서 2–3× 더 큰 모델의 성능에 맞춰줄 수 있다.
LM 지향 재랭킹(제로샷)과 예측 재랭킹은 vanilla BM25를 넘는 추가 perplexity 감소를 제공하며, 도메인 데이터로 학습된 예측 재랭킹이 뚜렷한 이점을 준다.
대형 모델의 경우 In-Context RALM이 훨씬 작은 모델의 성능을 향상시켜 더 큰 모델과 맞먹게 만드는 경우가 있다(예: BM25를 사용하는 6.7B OPT 모델이 특정 설정에서 66B OPT에 근접).
오픈 도메인 QA에서 검색된 문서를 컨텍스트로 제공하는 것은 LM을 고정한 상태에서도 성능을 크게 향상시키며, 일반적으로 두 개의 문서면 충분하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.