QUICK REVIEW

[논문 리뷰] Generalization through Memorization: Nearest Neighbor Language Models

Urvashi Khandelwal, Omer Levy|arXiv (Cornell University)|2019. 11. 01.

Topic Modeling참고 문헌 28인용 수 56

한 줄 요약

KNN-LMs은 k-최근접 이웃 데이터 저장소로 사전 학습된 언어 모델을 보강하여 예측을 보간하고, 추가 학습 없이 최첨단 perplexity를 달성하며 도메인 적응 및 데이터 효율적 확장을 가능하게 한다.

ABSTRACT

We introduce $k$NN-LMs, which extend a pre-trained neural language model (LM) by linearly interpolating it with a $k$-nearest neighbors ($k$NN) model. The nearest neighbors are computed according to distance in the pre-trained LM embedding space, and can be drawn from any text collection, including the original LM training data. Applying this augmentation to a strong Wikitext-103 LM, with neighbors drawn from the original training set, our $k$NN-LM achieves a new state-of-the-art perplexity of 15.79 - a 2.9 point improvement with no additional training. We also show that this approach has implications for efficiently scaling up to larger training sets and allows for effective domain adaptation, by simply varying the nearest neighbor datastore, again without further training. Qualitatively, the model is particularly helpful in predicting rare patterns, such as factual knowledge. Together, these results strongly suggest that learning similarity between sequences of text is easier than predicting the next word, and that nearest neighbor search is an effective approach for language modeling in the long tail.

연구 동기 및 목표

텍스트 맥락 간의 유사성 학습이 다음 단어를 예측하는 것보다 더 쉬울 수 있다는 가설을 제시한다.
retraining 없이 사전 학습된 LM에 k-최근접 이웃 보강을 제안하여 다음 토큰 예측을 개선한다.
학습 맥락의 명시적 기억이 perplexity를 개선하고 도메인 적응 및 데이터 효율적 확장을 가능하게 하는지 경험적으로 평가한다.

제안 방법

학습된 LM으로부터 맥락 표현과 다음 단어 타깃의 저장소를 구성한다.
테스트 맥락으로 저장소를 질의하여 임베딩 공간에서 L2 거리를 사용해 k개의 최근접 이웃을 검색한다.
검색된 이웃들로부터 다음 단어에 대한 p_kNN 분포를 계산하고 tunable lambda를 사용하여 기본 LM 분포와 보간한다.
고차원 키에 대해 확장 가능한 최근접 이웃 검색을 위해 64바이트 양자화 벡터를 가진 FAISS를 사용한다.
검증 데이터에서 보간 매개변수 lambda를 조정한다.
WikiText-103 및 Books, 그리고 데이터 저장소의 크기와 도메인을 달리하는 테스트 데이터에 대해 평가한다.

실험 결과

연구 질문

RQ1사전 학습된 LM의 맥락 표현을 kNN 검색을 통해 활용하여 추가 학습 없이 다음 토큰 예측을 개선할 수 있는가?
RQ2저장소의 크기와 보간 가중치가 perplexity 및 도메인 적응 성능에 어떤 영향을 미치는가?
RQ3더 크거나 다른 도메인의 데이터를 데이터 저장소를 통해 더 작은 LM을 효과적으로 보강하는 데 사용할 수 있는가?
RQ4학습 인스턴스의 명시적 기억이 사실 지식이나 고유 명사와 같은 롱테일 패턴에 더 큰 도움이 되는가?

주요 결과

모델	개발 데이터 perplexity (↓)	테스트 perplexity (↓)	# 학습 가능 매개변수 수
Baevski & Auli (2019)	17.96	18.65	247M
+ kNN-LM	16.06	16.12	247M
+ Continuous Cache	15.81	15.79	247M

A kNN-LM achieved a new state-of-the-art perplexity of 15.79 on Wikitext-103 with no extra training, improving over the base model by 2.86 points.
Using the training data as datastore yields substantial perplexity gains, and combining kNN with a continuous cache further improves results to 15.79 on Wikitext-103.
Datastore augmentation with 100M tokens and a 3B-token datastore can outperform training the same model on 3B tokens, showing data-efficient scaling.
Domain adaptation is effective: adding an in-domain Books datastore to a Wiki-3B model reduces Books perplexity from 34.84 to 20.47, approaching in-domain training performance.
Retrieving from larger/datastore-backed data improves performance monotonically, and the optimal lambda increases with datastore size for domain adaptation.
Qualitative analysis shows kNN-LM better handles long-tail patterns and factual knowledge by explicit memory, compared to relying solely on implicit parameters.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.