QUICK REVIEW

[논문 리뷰] How Context Affects Language Models' Factual Predictions

Fabio Petroni, Patrick Lewis|arXiv (Cornell University)|2020. 05. 10.

Topic Modeling참고 문헌 42인용 수 80

한 줄 요약

논문은 테스트 시점에 retrieved contexts를 사용하여 감독 없이도 사전 학습된 언어 모델(BERT/RoBERTa)의 사실적 클로즈형 QA를 크게 향상시키고, 감독된 baselines에 대응하며, BERT의 Next Sentence Prediction이 맥락의 노이즈를 걸러주는 데 도움을 준다고 보여준다.

ABSTRACT

When pre-trained on large unsupervised textual corpora, language models are able to store and retrieve factual knowledge to some extent, making it possible to use them directly for zero-shot cloze-style question answering. However, storing factual knowledge in a fixed number of weights of a language model clearly has limitations. Previous approaches have successfully provided access to information outside the model weights using supervised architectures that combine an information retrieval system with a machine reading component. In this paper, we go a step further and integrate information from a retrieval system with a pre-trained language model in a purely unsupervised way. We report that augmenting pre-trained language models in this way dramatically improves performance and that the resulting system, despite being unsupervised, is competitive with a supervised machine reading baseline. Furthermore, processing query and context with different segment tokens allows BERT to utilize its Next Sentence Prediction pre-trained classifier to determine whether the context is relevant or not, substantially improving BERT's zero-shot cloze-style question-answering performance and making its predictions robust to noisy contexts.

연구 동기 및 목표

테스트 시점에 맥락을 검색하는 것이 사전 학습된 언어 모델에서 감독 없이도 사실 지식을 해방시킬 수 있음을 보여준다.
맥락 유형(oracle, retrieved, generated, adversarial)이 LAMA 기반 클로즈 QA 성능에 미치는 영향을 정량화한다.
BERT/RoBERTa의 노이즈 맥락에 대한 강건성과 NSP가 맥락 관련성 선별에 미치는 역할을 평가한다.
비지도 검색 보강 LM의 성능을 감독된 오픈 도메인 QA 기준선(DrQA)과 비교한다.

제안 방법

LAMA 관계 탐침에서 cloze-style 질문을 사용하여 BERT-large와 RoBERTa-large를 평가한다.
cloze 프롬프트에 서로 다른 맥락 유형을 보강한다: oracle(위키피디아의 발췌문), retrieved(DrQA 유사 TF-IDF 단락), generated(자회귀 LM 맥락), adversarial(관계가 없는 맥락).
적용 가능하면 BERT의 모델별 세그먼트 토큰으로 질문과 맥락을 구분하거나 eos/separator를 사용한다.
Google-RE, T-REx, 그리고 SQuAD 파생 부분집합에서 단일 토큰 답변에 대한 P@1을 측정한다.
NSP 분류기 활성화와 맥락 활용 가능성에 대한 입력 분할의 중요성을 분석한다.
감독형 오픈 도메인 QA 기준선으로 DrQA와 비교하고 비지도 QA에 대한 시사점을 논의한다.

실험 결과

연구 질문

RQ1비지도 검색 보강 언어 모델이 사실 지식 작업에서 감독된 QA 성능에 도달할 수 있는가?
RQ2맥락 유형(oracle, retrieved, generated, adversarial)이 LM 기반 클로즈 QA 정확도에 어떤 영향을 미치는가?
RQ3맥락 활용에서 BERT의 NSP 목표와 입력 분할이 어떤 역할을 하는가?
RQ4검색된 맥락으로 얻는 개선이 관계와 데이터셋 전반에서 강건한가?

주요 결과

맥락이 풍부한 프롬프트는 LM의 사실적 QA를 크게 향상시킨다: B-ora(oracle)는 맥락 없는 입력 대비 큰 이득을 주고, B-ret(retrieved)는 종종 감독된 베이스라인과 같거나 이를 능가한다.
BERT는 retrieved 맥락과 함께 Google-RE와 SQuAD에서 DrQA와 경쟁력이 있으며, 맥락 없는 기준선 대비 여러 관계에서 큰 개선을 보인다.
적대적 맥락은 두 세그먼트 입력을 사용할 때도 BERT의 강건함을 보이며, NSP가 관련 없는 맥락을 필터링하는 데 도움이 됨을 시사한다; 연결(concatenation)은 성능을 크게 저하시키는 것으로 나타났다.
생성된 맥락은 일부 관계에서 도움이 되지만 일반적으로 retrieved 또는 oracle 맥락보다 덜 효과적이며 노이즈가 있을 때 오도할 수 있다.
BERT의 NSP 기반 관련성 신호는 맥락에 대해 강건한 조건 부여를 가능하게 하는 것으로 보이며, 감독 미세조정 없이도 정확도 향상에 기여한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.