QUICK REVIEW

[論文レビュー] TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao, Yichun Yin|arXiv (Cornell University)|Sep 23, 2019

Topic Modeling参考文献 53被引用数 136

ひとこと要約

TinyBERT は、革新的な Transformer 蒸留手法と二段階学習フレームワークを用いて、BERT をより小さく高速なモデルに圧縮し、GLUE で競争力のある性能を達成します。

ABSTRACT

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

研究の動機と目的

エッジデバイス上での事前学習済み言語モデルの計算オーバーヘッドを削減するモチベーションを示しつつ、精度を維持する。
教師モデルBERTの知識をより小さな学生モデルへ転送するためのTransformer特化の知識蒸留法を導入する。
一般領域知識とタスク固有知識を捉えるための二段階学習フレームワーク（一般蒸留とタスク固有蒸留）を提案する。
TinyBERT がGLUEで競争力のある性能を維持しつつ、著しい速度向上とパラメータ削減を達成することを実証する。

提案手法

Embedding層蒸留、アテンションベース蒸留、隠れ状態蒸留の3つの要素からなるTransformer蒸留損失を提案し、予測層蒸留を加える。
蒸留のために学生と教師の層を整合させる層マッピング関数 g(m) を使用する。
2段階で訓練する。まずファインチューニング前のBERTを教師として大規模一般コーパス上で一般蒸留を行い、次にファインチューニング済みBERTを教師としてデータ拡張を用いたタスク固有蒸留を行う。
タスク固有蒸留では、BERTの予測とGloVeの類似度を組み合わせてトレーニングデータを拡張するデータ拡張を行う。
GLUEベンチマークで、TinyBERT（4層および6層）を従来のKDベースラインおよび教師としてBERT BASEと比較して評価する。

実験結果

リサーチクエスチョン

RQ1Transformer特化の知識蒸留は、BERTからより小さな学生へ知識を効果的に転送できるか。
RQ2二段階蒸留フレームワーク（一般蒸留とタスク固有蒸留）は、単一段階のアプローチよりTinyBERTの性能を向上させるか。
RQ3埋め込み層、アテンション、隠れ状態レベルの蒸留は最終性能にどう寄与するか。
RQ4BERTをTinyBERTに圧縮する際のパラメータ数、FLOPs、推論速度のトレードオフは何か。
RQ54層または6層のTinyBERTはGLUEタスクでBERT BASEにどれだけ近づけるか。

主な発見

TinyBERT 4 は GLUE で BERT BASE の性能の 96.8% 超を達成しつつ、約7.5倍小さく、推論時約9.4倍高速。
TinyBERT 6 は GLUE で BERT BASE の性能と同等。
TinyBERT 4 は 4層KDベースライン（BERT-PKD、DistilBERT 4）を平均で少なくとも4.4%上回る。
TinyBERT 4 は、BERT BASEのパラメータ約13.3%、推論時間約10.6%しかないにもかかわらず、強力な結果を達成。
二段階学習（一般蒸留とデータ拡張付きのタスク固有蒸留）は性能向上にとって重要である。
アテンションベース蒸留は顕著な向上をもたらし、それを隠れ状態蒸留と組み合わせることは補完的である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。