QUICK REVIEW

[논문 리뷰] MPNet: Masked and Permuted Pre-training for Language Understanding

Kaitao Song, Xu Tan|arXiv (Cornell University)|2020. 04. 20.

Topic Modeling참고 문헌 26인용 수 503

한 줄 요약

MPNet은 예측 토큰 간의 의존성을 모델링하고 전체 문장 위치 정보를 사용함으로써 MLM(BERT)과 PLM(XLNet)을 통합하여 GLUE, SQuAD 및 기타 벤치마크에서 강력한 이익을 달성합니다.

ABSTRACT

BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning. In this paper, we propose MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting. The code and the pre-trained models are available at: https://github.com/microsoft/MPNet.

연구 동기 및 목표

사전 학습을 개선하기 위해 MLM의 한계(독립 토큰 예측)와 PLM의 한계(전체 문장 위치 정보의 부재)를 다루는 동기를 부여한다.
예측 토큰 간의 의존성을 활용하면서 전체 문장 위치 정보를 포함하는 사전 학습 목적을 개발한다.
다양한 NLP 벤치마크에서 MPNet가 BERT, XLNet, RoBERTa 및 ELECTRA에 비해 상당한 개선을 보여줌을 입증한다.

제안 방법

뒤섞인 시퀀스에 대해 MPNet 목표를 도입하고 P(x_z_t | x_z_<t, M_z>c; θ)를 최대화한다.
예측 토큰 간의 출력 의존성을 모델링하기 위해 이중 스트림 자기 주의 메커니즘을 사용한다.
사전 학습 중 쿼리 스트림과 컨텐츠 스트림이 전체 문장 정보를 볼 수 있도록 위치 보상을 적용한다.
입력 설계: 원래 시퀀스의 순서를 순열로 섞되 예측되지 않은 토큰은 연결하고 예측 부분의 토큰은 마스킹한다.
대규모 코퍼스(~160GB)에서 학습하고 다운스트림 태스크(GLUE, SQuAD, RACE, IMDB)에서 파인튜닝한다.
동일한 모델 설정에서 MPNet를 MLM 및 PLM과 비교하고 강력한 베이스라인(BERT, XLNet, RoBERTa)과 비교한다.

실험 결과

연구 질문

RQ1예측 토큰 간의 의존성(출력 의존성)을 사전 학습 중 활용하여 MLM를 넘는 표현 학습을 개선할 수 있는가?
RQ2전체 문장 위치 정보를 포함하는 것이 PLM에 비해 사전 학습과 미세 조정 간의 차이를 줄이는가?
RQ3이전 사전 학습 방법과 비교하여 표준 벤치마크(GLUE, SQuAD, RACE, IMDB)에서 MPNet의 성능은 어떤가?
RQ4MPNet에서 위치 보상과 순열 메커니즘의 경험적 영향은 무엇인가?

주요 결과

MPNet가 동일한 기본 모델 설정에서 GLUE 개발 세트에서 MLM 및 PLM보다 큰 차이로 우수한 성능을 보인다.
MPNet이 보고된 실험에서 GLUE 벤치마크에서 BERT, XLNet, RoBERTa보다 더 나은 성과를 달성한다.
SQuAD v1.1 및 v2.0에서 MPNet은 보고된 지표에서 BERT, XLNet, RoBERTa를 능가한다.
MPNet은 16GB 데이터로 사전 학습했을 때 RACE와 IMDB에서 강력한 성과를 보이며, 160GB로 사전 학습하면 더 큰 이득이 있다.
변형 실험은 위치 보상과 출력 의존성이 MPNet의 성능에 중요함을 확인한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.