Skip to main content
QUICK REVIEW

[論文レビュー] FastBERT: a Self-distilling BERT with Adaptive Inference Time

Weijie Liu, Peng Zhou|arXiv (Cornell University)|Apr 5, 2020
Topic Modeling参考文献 29被引用数 57
ひとこと要約

FastBERTはサンプル単位の適応推論メカニズムと自己蒸留を単一のフレームワーク内で導入し、精度を維持しつつBERT様モデルを高速化します。速度と精度のトレードオフに応じて、1x to 12x の速度向上を実現します。

ABSTRACT

Pre-trained language models like BERT have proven to be highly performant. However, they are often computationally expensive in many practical scenarios, for such heavy models can hardly be readily implemented with limited resources. To improve their efficiency with an assured model performance, we propose a novel speed-tunable FastBERT with adaptive inference time. The speed at inference can be flexibly adjusted under varying demands, while redundant calculation of samples is avoided. Moreover, this model adopts a unique self-distillation mechanism at fine-tuning, further enabling a greater computational efficacy with minimal loss in performance. Our model achieves promising results in twelve English and Chinese datasets. It is able to speed up by a wide range from 1 to 12 times than BERT if given different speedup thresholds to make a speed-performance tradeoff.

研究の動機と目的

  • Motivate: reduce inference cost of BERT in industrial settings with varying request loads.
  • Propose: a speed-tunable BERT variant (FastBERT) with sample-wise adaptive inference and a self-distillation training regime.
  • Demonstrate: FastBERT achieves substantial speedups (1–12x) with minimal accuracy loss on twelve English/Chinese NLP tasks.
  • Showcase: compatibility with existing BERT-like models and practical deployment benefits.

提案手法

  • Backbone: a 12-layer Transformer encoder with a teacher classifier.
  • Branches: lightweight student classifiers attached to each Transformer output for early exits.
  • Training: three-stage process — backbone pre-training, backbone fine-tuning, and self-distillation of student branches via KL-divergence to the teacher outputs.
  • Self-distillation: uses teacher soft-labels to supervise all student branches within the same model, enabling unlabeled data usage for distillation.
  • Adaptive inference: at each layer, compute normalized entropy (uncertainty) of student outputs and halt samples with uncertainty below a speed threshold (Speed).
  • Uncertainty-Speed rule: Lower Uncertainty implies higher accuracy (LUHA hypothesis) and higher Speed yields faster overall inference.

実験結果

リサーチクエスチョン

  • RQ1Does introducing sample-wise adaptive inference reduce computation with minimal accuracy loss compared to BERT?
  • RQ2Can self-distillation within a single model improve student-branch performance without external teacher models?
  • RQ3How does the speed-accuracy trade-off behave across diverse English and Chinese NLP tasks?
  • RQ4Is the LUHA hypothesis validated across layers and datasets?

主な発見

  • FastBERT achieves 2–5x speedups with negligible accuracy loss on most datasets at Speed=0.1.
  • When allowing larger accuracy loss, FastBERT can reach 7–11x speedups over BERT.
  • The model demonstrates speedups ranging from 1x to 12x depending on the chosen Speed threshold, with competitive accuracy.
  • Adaptive inference significantly reduces FLOPs by moving easier samples to early exits, as shown by layer-exit distributions.
  • Self-distillation enables a set of lightweight student classifiers to approach teacher performance, while reducing overall FLOPs during inference.
  • LUHA hypothesis is empirically validated: lower uncertainty correlates with higher accuracy across bottom, middle, and top classifiers.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。