QUICK REVIEW

[논문 리뷰] TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao, Yichun Yin|arXiv (Cornell University)|2019. 09. 23.

Topic Modeling참고 문헌 53인용 수 136

한 줄 요약

TinyBERT는 새로운 Transformer 증류 방법과 두 단계 학습 프레임워크를 사용하여 BERT를 더 작고 빠른 모델로 압축하고 GLUE에서 경쟁력 있는 성능을 달성합니다.

ABSTRACT

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

연구 동기 및 목표

edge 디바이스에서 사전 학습된 언어 모델의 계산 오버헤드를 줄이되 정확성을 유지하는 것을 목표로 한다.
교사 BERT의 지식을 더 작은 학생 모델로 전달하기 위한 Transformer 특화 지식 증류 방법을 도입한다.
일반 도메인 지식과 작업 특화 지식을 포착하기 위한 두 단계 학습 프레임워크(일반 증류와 작업 특화 증류)를 제안한다.
TinyBERT가 GLUE에서 competitive한 성능을 유지하면서 상당한 속도 증가와 매개변수 감소를 달성함을 입증한다.

제안 방법

임베딩 계층 증류, 주의(attention) 기반 증류, 은닉 상태 증류를 포함하고 예측 계층 증류를 추가하는 Transformer 증류 손실을 제안한다.
증류를 위한 학생-교사 계층 정렬을 맞추기 위해 레이어 매핑 함수 g(m)을 사용한다.
두 단계로 학습한다: 일반 대규모 일반 코퍼스에서의 일반 증류는 미-파인튜닝된 BERT를 교사로 사용하고, 그다음 데이터 보강을 통한 작업 특화 증류를 파인튜닝된 BERT를 교사로 사용한다.
작업 특화 증류에서 BERT 예측과 GloVe 유사도를 결합한 데이터 증강으로 학습 데이터를 확장한다.
GLUE 벤치마크에서 TinyBERT(4 및 6 계층)를 이전 KD 기준선과 BERT BASE 교사를 비교 평가한다.

실험 결과

연구 질문

RQ1Transformer 특화 지식 증류가 BERT에서 더 작은 학생으로 지식을 효과적으로 전달할 수 있는가?
RQ2두 단계 증류 프레임워크(사전 학습 증류 및 작업 특화 증류)가 단일 단계 접근법보다 TinyBERT의 성능을 향상시키는가?
RQ3임베딩-, 주의-, 은닉 상태 수준의 증류가 최종 성능에 어떻게 기여하는가?
RQ4BERT를 TinyBERT로 압축할 때 매개변수, FLOPs, 추론 속도 간의 트레이드오프는 무엇인가?
RQ5GLUE 과제에서 4계층 또는 6계층 TinyBERT가 BERT BASE에 얼마나 근접할 수 있는가?

주요 결과

TinyBERT 4는 GLUE에서 BERT BASE 성능의 96.8% 이상을 달성하면서도 약 7.5배 더 작고 추론 속도는 약 9.4배 빨라진다.
TinyBERT 6은 GLUE에서 BERT BASE 성능과 일치한다.
TinyBERT 4는 4계층 KD 기준선들(BERT-PKD, DistilBERT 4)보다 평균적으로 최소 4.4% 이상 우수하다.
TinyBERT 4는 BERT BASE 매개변수의 약 13.3%와 추론 시간의 약 10.6%만 사용함에도 강력한 결과를 달성한다.
일반 증류와 데이터 증강을 통한 작업 특화 증류의 이원적 학습(two-stage learning)이 성능 향상에 핵심적이다.
주의 기반 증류는 상당한 증가를 가져오며 은닉 상태 증류와의 결합은 보완적이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.