QUICK REVIEW

[논문 리뷰] DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu|arXiv (Cornell University)|2020. 06. 05.

Topic Modeling참고 문헌 61인용 수 422

한 줄 요약

DeBERTa는 내용 벡터와 위치 벡터를 분리한 disentangled attention과 향상된 마스크 디코더, 그리고 스케일 불변 파인튜닝(scale-invariant fine-tuning)을 도입하여, NLU와 NLG 태스크에서 기존 PLMs를 능가하고, 1.5B 매개변수 모델로 SuperGLUE에서도 인간 성능을 넘어서게 한다.

ABSTRACT

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

연구 동기 및 목표

BERT/RoBERTa 기본 모델 대비 사전학습 효율성과 다운스트림 NLP 성능을 향상시키는 것.
콘텐츠와 위치 정보를 분리하는 disentangled attention 메커니즘을 도입하는 것.
디코딩에 절대 위치 정보를 도입(향상된 마스크 디코더)하여 MLM task를 돕는 것.
일반화 능력을 개선하기 위해 강건한 미세조정을 위한 가상 적대적 학습(SiFT)을 적용하는 것.

제안 방법

각 토큰을 콘텐츠용 벡터와 위치용 벡터의 두 벡터로 표현한다.
주의를 콘텐츠-콘텐츠, 콘텐츠-포지션, 포지션-콘텐츠의 네 구성요소로 계산하고, 일반적으로 포지션-포지션은 생략한다.
효율성을 위해 고정된 2k 범위를 갖는 상대 위치 임베딩을 사용한다.
MLM 사전학습 중 Transformer 층 이후에 Enhanced Mask Decoder를 통해 절대 위치 정보를 도입한다.
정규화된 임베딩을 섞어 perturb하는 방식으로 견고한 다운스트림 파인튜닝을 위한 Scale-invariant Fine-Tuning (SiFT)을 도입한다.
대형(1.5B) 및 기본 DeBERTa 모델을 약 78–160GB 규모의 텍스트 데이터세트에서 사전학습하고 GLUE/SuperGLUE/NLG 태스크로 평가한다.

실험 결과

연구 질문

RQ1분리된 어텐션이 표준 셀프 어텐션보다 NLP 태스크에서 성능을 향상시키는가?
RQ2Enhanced Mask Decoder를 통한 절대 위치 정보 도입이 MLM 사전학습에 미치는 영향은 무엇인가?
RQ3SiFT가 대형 DeBERTa 모델의 파인튜닝 견고성 및 일반화를 개선할 수 있는가?
RQ4RoBERTa, XLNet, ELECTRA 등과 비교했을 때 모델 크기에 따른 DeBERTa의 성능 확장은 어떤가?
RQ5SuperGLUE와 같은 도전 벤치마크에서 사람 baselines를 능가할 수 있는가?

주요 결과

모델	CoLA MCC	QQP 정확도	MNLI-m 정확도	MNLI-mm 정확도	SST-2 정확도	STS-B 상관계수	QNLI 정확도	RTE 정확도	MRPC 정확도	평균
BERT large	60.6	91.3	86.6/-	93.2	90.0	70.4	92.3	88.0	-	84.05
RoBERTa large	68.0	92.2	90.2/90.2	96.4	92.4	93.9	86.6	90.9	90.9	88.82
XLNet large	69.0	92.3	90.8/90.8	97.0	92.5	94.9	85.9	90.8	90.8	89.15
ELECTRA large	69.1	92.4	90.9/-	96.9	92.6	95.0	88.0	90.8	90.0	89.46
DeBERTa large	70.5	92.3	91.1/91.1	96.8	92.8	95.3	88.3	91.9	90.0	90.00

DeBERTa 대형 모델은 비슷한 학습 데이터를 사용한 RoBERTa 대형 및 XLNet 대형보다 대부분의 GLUE 태스크에서 더 높은 평균 점수를 달성한다.
DeBERTa 대형은 RoBERTa-Large 대비 MNLI 및 SQuAD 이득을 준다(예: MNLI +0.9%, SQuAD v2.0 +2.3%, RACE +3.6%).
1.5B 파라미터 DeBERTa 모델은 SuperGLUE 매크로 평균에서 89.9를 달성하여 인간 기준선(89.8)을 넘어선다.
DeBERTa 기본 모델(12L, 768 차원)은 제거 실험에서도 MNLI, SQuAD, RACE에서 RoBERTa 및 XLNet를 지속적으로 앞선다.
에임버레이션은 EMD 제거나 C2P, P2C 용어 중 하나를 제거하면 벤치마크 전반에서 성능이 저하됨을 보여주며, 각 구성요소의 기여를 확인시킨다.
1.5B 매개변수로 규모를 확장하면 NLU와 NLG 태스크 모두에서 성능이 향상되고 비교적 큰 모델들(T5 11B)에 비해 더 에너지 효율적인 대안을 제공한다.
SuperGLUE에서 DeBERTa 1.5B에 SiFT를 결합하면 경쟁력 있는 점수를 얻고 앙상블이 2021년 기준으로 최상위에 랭크된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.