QUICK REVIEW

[논문 리뷰] MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Wenhui Wang, Furu Wei|arXiv (Cornell University)|2020. 02. 25.

Topic Modeling참고 문헌 57인용 수 632

한 줄 요약

이 논문은 MiniLM을 제시한다. 태스크-무관한, 심층 자기-주의 증류 방법으로 대형 Transformer LMs를 압축하되 교사(teacher)의 마지막 계층 자기-주의만 모방하고, value-relations를 더하여 유연한 학생 아키텍처를 가능하게 하며 매개변수 수를 크게 줄이면서도 강한 성능을 달성한다.

ABSTRACT

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (i.e., the scaled dot-product of queries and keys) that have been used in existing works. Moreover, we show that introducing a teacher assistant (Mirzadeh et al., 2019) also helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models. In particular, it retains more than 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks using 50% of the Transformer parameters and computations of the teacher model. We also obtain competitive results in applying deep self-attention distillation to multilingual pre-trained models.

연구 동기 및 목표

대형 사전 학습된 Transformer LMs(예: BERT)를 더 빠른 파인튜닝 및 서빙을 위해 압축하는 것을 동기화한다.
마지막 계층의 교사를 깊이 모방하는 태스크-무관한 증류 프레임워크를 제안한다.
추가 매개변수 없이 전이 가능한 self-attention value-relations를 추가 지식으로 도입한다.
더 작은 학생(예: 6-layer, 768-d)이 상당한 속도 향상과 함께 교사에 근접한 성능을 달성할 수 있음을 보여준다.
교사 보조자(중간 크기의 학생)를 도입하면 특히 매우 작은 학생의 성능이 더욱 향상될 수 있음을 보인다.

제안 방법

학생을 교사의 마지막 Transformer 계층의 self-attention 모듈을 깊이 모방하도록 학습시킨다.
self-attention 분포(queries–keys)와 값의 스케일된 도트 곱(value-relations)을 지식으로 이식한다.
교사와 학생의 attention 분포 간의 KL-발산을 통해 attention map 전달 손실을 계산한다.
교사와 학생의 value-relations 행렬 간의 KL-발산으로 value-relations 전달 손실을 계산한다; 이 전달에 추가 매개변수는 필요하지 않다.
필요시 교사 보조자(중간 크기의 학생)를 사용하여 대형 교사–학생 간 격차를 줄이고 성능을 향상시킨다.
이전의 태스크-무관한 증류 방법과 비교하고, 마지막 계층, value-relations, TA의 이점을 입증한다.

실험 결과

연구 질문

RQ1태스크-무관한 증류가 교사의 마지막 계층 self-attention만 모방해도 효과적일 수 있는가?
RQ2attention 분포 외에 value-relations를 함께 전이하면 더 깊은 모방과 더 나은 학생 성능을 얻을 수 있는가?
RQ3교사 보조자를 도입하면 증류가 개선되며 특히 작은 학생에게 효과적인가?
RQ4계층 간 매핑 없이도 계층 수와 숨겨진 차원 크기가 다른 유연한 학생 아키텍처를 지원할 수 있는가?

주요 결과

BERT-BASE에서 증류된 6-layer, 768-hidden MiniLM 학생모델이 SQuAD 2.0 및 GLUE 태스크에서 높은 성능을 유지하면서도 훨씬 빠르다.
교사 마지막 계층의 attention 분포와 value-relations를 모두 이식하면, attention 분포만 이식하거나 다른 기준선보다 측정 가능한 이득이 생긴다.
value-relations 전달은 추가 매개변수 없이 더 깊은 자기-주의 모방을 제공하며, 다양한 태스크와 학생 구성을 통해 결과를 개선한다.
교사 보조자는 더 작은 학생의 성능을 더욱 향상시켜 교사와 학생 간의 격차를 줄이는 데 도움이 된다.
MiniLM은 substantially fewer Transformer 매개변수로도 경쟁력 있는 다국어 모델 성능을 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.