QUICK REVIEW

[논문 리뷰] Autoregressive Search Engines: Generating Substrings as Document Identifiers

Michele Bevilacqua, Giuseppe Ottaviano|arXiv (Cornell University)|2022. 04. 22.

Topic Modeling인용 수 66

한 줄 요약

SEAL은 자동회귀 언어모델과 압축된 전체 텍스트 부분 문자열 인덱스(FM-index)를 결합하여 문서 식별자로 n그램을 생성 및 점수화하고, 효율적 검색을 가능하게 하며 지식 집약적 과제에서 강력한 다운스트림 결과를 달성합니다.

ABSTRACT

Knowledge-intensive language tasks require NLP systems to both provide the correct answer and retrieve supporting evidence for it in a given corpus. Autoregressive language models are emerging as the de-facto standard for generating answers, with newer and more powerful systems emerging at an astonishing pace. In this paper we argue that all this (and future) progress can be directly applied to the retrieval problem with minimal intervention to the models' architecture. Previous work has explored ways to partition the search space into hierarchical structures and retrieve documents by autoregressively generating their unique identifier. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers. This setup allows us to use an autoregressive model to generate and score distinctive ngrams, that are then mapped to full passages through an efficient data structure. Empirically, we show this not only outperforms prior autoregressive approaches but also leads to an average improvement of at least 10 points over more established retrieval solutions for passage-level retrieval on the KILT benchmark, establishing new state-of-the-art downstream performance on some datasets, while using a considerably lighter memory footprint than competing systems. Code and pre-trained models at https://github.com/facebookresearch/SEAL.

연구 동기 및 목표

지식 집약적 검색을 개선하기 위한 자동회귀 모델 활용 동기 부여
문서의 모든 n그램을 식별자로 사용하고 구조적 제약을 부과하지 않는 검색 방법 제안
생성 생성을 제약하고 문서를 검색하기 위해 자동회귀 LM과 FM-index를 통합
LM 확률과 코퍼스 빈도(ngram) 결합으로 새로운 점수 메커니즘 개발

제안 방법

문서에서 고정 길이의 제약된 n그램을 생성하기 위해 자동회귀 모델로 BART 사용
디코딩을 제약하고 생성된 n그램을 포함하는 문서를 O(|n| log |V|)로 식별하기 위해 FM-index 사용
P(n|q)와 코퍼스 빈도 P(n)을 결합하여 LM+FM 점수 형태로 문서 점수화
문서당 여러 n그램을 합산하는 교차적 점수화(인터섹티브 점수화)와 커버리지 인식 가중치 도입
KILT 데이터셋에 대한 지도 및 비지도 신호로 SEAL을 학습시켜 견고한 n그램 생성 학습

실험 결과

연구 질문

RQ1n그램의 자동회귀 생성이 predefined 인덱스 구조를 강제하지 않고도 검색에 효과적인 식별자를 제공할 수 있는가?
RQ2LM 확률과 FM-index 빈도를 결합하면 데이터셋 간 검색 정확도와 강인성이 향상되는가?
RQ3문서당 여러 n그램을 집계하는 교차적 점수화가 단일그램 점수화보다 더 나은 랭킹을 제공하는가?
RQ4SEAL이 표준 리더와 함께 문장 수준 검색 및 다운스트림 QA에서 어떤 성능을 보이는가?

주요 결과

SEAL은 알려진 벤치마크에서 최근의 자동회귀 검색 방법과 동등하거나 더 우수하게 작동합니다.
LM+FM 및 교차적 점수화는 리더와 함께 사용할 때 여러 데이터셋에서 최첨단 다운스트림 성능을 제공합니다.
SEAL은 메모리 효율이 뛰어나고 여러 기준선보다 경량 인덱스 발자국을 보입니다.
해석 가능한 n그램 생성이 새로운 질문과 답변에 대한 일반화 능력을 향상시키는 것을 보여줍니다.
SEAL의 교차적 점수화는 보완적인 n그램 신호를 집계하여 상위-k 검색 성능을 향상시킵니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.