QUICK REVIEW

[論文レビュー] Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

Raphael Tang, Yao Lu|arXiv (Cornell University)|Mar 28, 2019

Topic Modeling参考文献 36被引用数 337

ひとこと要約

本研究は、BERT からタスク特化の知識を単層 BiLSTM（文ペアにはシアム BiLSTM）に蒸留し、パラメータ数を大幅に減らし推論を大幅に高速化しつつ、ELMo にほぼ近い性能を達成する。

ABSTRACT

In the natural language processing literature, neural networks are becoming increasingly deeper and complex. The recent poster child of this trend is the deep language representation model, which includes BERT, ELMo, and GPT. These developments have led to the conviction that previous-generation, shallower neural networks for language understanding are obsolete. In this paper, however, we demonstrate that rudimentary, lightweight neural networks can still be made competitive without architecture changes, external training data, or additional input features. We propose to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks. Across multiple datasets in paraphrasing, natural language inference, and sentiment classification, we achieve comparable results with ELMo, while using roughly 100 times fewer parameters and 15 times less inference time.

研究の動機と目的

Question whether simple architectures can compete with deep transformers for NLP tasks.
Demonstrate knowledge transfer from BERT to a lightweight BiLSTM via distillation.
Show effectiveness of a rule-based data augmentation approach for distillation in NLP.

提案手法

Use BERT as teacher to guide a single-layer BiLSTM student through logit/soft target distillation.
Apply a distillation loss that minimizes the MSE between teacher and student logits (L_distill).
Combine distillation loss with cross-entropy, controlled by a mixing parameter alpha (L = alpha*L_CE + (1-alpha)*L_distill).
Construct a transfer dataset using a rule-based data augmentation strategy (masking, POS-guided replacement, n-gram sampling).
For sentence-pair tasks, employ a siamese BiLSTM with a concatenate–compare classifier.
Report results on GLUE tasks SST-2, MNLI, QQP to compare with ELMo and BERT baselines.

実験結果

リサーチクエスチョン

RQ1Can a shallow BiLSTM model reach competitive performance with a BERT teacher through knowledge distillation?
RQ2How does logit-level distillation compare to standard supervised training for a small student?
RQ3Does rule-based data augmentation improve distillation effectiveness in NLP tasks?
RQ4What are the trade-offs in accuracy versus parameter count and inference speed when distilling BERT into a BiLSTM?
RQ5How does the distilled BiLSTM fare against ELMo and transformer baselines on GLUE tasks?

主な発見

Distilled BiLSTM with soft targets closely matches ELMo-level performance on SST-2 and QQP and improves MNLI over a non-distilled BiLSTM.
The distilled BiLSTM achieves comparable results to ELMo while using roughly 100x fewer parameters and 15x faster inference for single-sentence tasks.
On MNLI, the distilled BiLSTM improves over the base BiLSTM by 4.3 points and beats some prior BiLSTM results, though still behind BERT/Large and ELMo baselines.
The approach yields 2.2e6 parameters for the 300-unit BiLSTM variant and shows substantial efficiency gains compared to BERT-LARGE and ELMo in inference speed.
The siamese BiLSTM for sentence-pair tasks provides linear runtime with sentence length by avoiding pairwise word interactions.
Overall, shallow BiLSTMs with distillation are competitive with two implementations of ELMo and offer strong efficiency advantages; they do not surpass deep transformer models on average.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。