QUICK REVIEW

[논문 리뷰] ConvBERT: Improving BERT with Span-based Dynamic Convolution

Zihang Jiang, Weihao Yu|arXiv (Cornell University)|2020. 08. 06.

Topic Modeling참고 문헌 69인용 수 118

한 줄 요약

ConvBERT은 스팬 기반 동적 컨볼루션으로 중복된 어텐션 헤드를 대체하고 혼합 어텐션 블록을 형성하며 병목 및 그룹 피드포워드를 도입하여 BERT보다 낮은 pre-training 비용으로 GLUE/SQuAD 성능을 달성합니다.

ABSTRACT

Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for generating the attention map from a global perspective, we observe some heads only need to learn local dependencies, which means the existence of computation redundancy. We therefore propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning. We equip BERT with this mixed attention design and build a ConvBERT model. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training cost and fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while using less than 1/4 training cost. Code and pre-trained models will be released.

연구 동기 및 목표

BERT의 self-attention 헤드의 중복성을 지역 의존성을 활용하여 줄이고자 한다.
지역 맥락을 효율적으로 포착하기 위해 스팬 기반 동적 컨볼루션을 도입한다.
ConvBERT를 혼합 어텐션 블록, 병목 어텐션, 그리고 그룹 피드포워드로 구성하여 효율성과 성능을 향상시키고자 한다.
GLUE와 SQuAD에서 ConvBERT를 평가하여 더 낮은 학습 비용으로 정확도 향상을 입증한다.

제안 방법

로컬 입력 스팬에서 Q와 로컬 K_s에 조건부로 커널을 생성하는 스팬 기반 동적 컨볼루션을 제안한다.
Self-Attention과 Span-based Dynamic Convolution을 같은 Q를 공유하되 서로 다른 키/베이스를 사용하는 Mixed-Attn 블록으로 결합한다.
self-attention 경로와 헤드의 차원을 줄이기 위한 병목 구조를 도입한다.
피드포워드 모듈에서 파라미터와 계산을 줄이기 위해 그룹형 선형 연산자를 적용한다.
ELECTRA의 대체 토큰 탐지와 유사한 사전 학습 설정에서 ConvBERT를 학습하고 GLUE와 SQuAD를 평가한다.

실험 결과

연구 질문

RQ1스팬 기반 동적 컨볼루션이 표준 self-attention보다 로컬 의존성을 더 효율적으로 포착할 수 있는가?
RQ2스팬 기반 동적 컨볼루션을 self-attention과 통합하면 중복성을 줄이고 하류 태스크 성능을 향상시키는가?
RQ3비슷하거나 더 낮은 학습 비용에서 ConvBERT를 사용할 때 GLUE와 SQuAD 벤치마크의 이점은 BERT와 ELECTRA 대비 얼마나 되는가?

주요 결과

모델	학습 FLOPs	매개변수	MNLI	QNLI	QQP	RTE	SST-2	MRPC	CoLA	STS-B	평균
Conv BERTbase	1.9e19 (15x)	106M	85.3	92.4	89.6	74.7	95.0	88.2	66.0	88.2	84.9
Conv BERTbase (train longer)	7.6e19 (59x)	106M	88.3	93.2	90.0	77.9	95.7	88.3	67.8	89.7	86.4

ConvBERT는 비슷한 규모의 BERT 및 ELECTRA 기준선보다 GLUE에서 더 낮은 pre-training 비용으로 우수한 성능을 보인다.
기본 크기의 ConvBERT는 86.4 GLUE 점수를 달성하며 ELECTRAbase보다 0.7 높고 학습 비용은 1/4도 채 되지 않는다.
스팬 기반 동적 컨볼루션은 일반적인 동적 컨볼루션 및 평행한 기존 컨볼루션 대비 뚜렷한 이점을 준다.
병목 어텐션과 그룹 피드포워드는 파라미터를 줄이면서도 성능을 유지하거나 향상시키는 데 도움을 준다.
ConvBERT 소형/베이스 모델은 Baseline 모델 대비 FLOPs/매개변수 면에서 우수하며 태스크 성능을 유지하거나 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.