Skip to main content
QUICK REVIEW

[論文レビュー] BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

Canwen Xu, Wangchunshu Zhou|arXiv (Cornell University)|Feb 7, 2020
Topic Modeling参考文献 44被引用数 34
ひとこと要約

BERT-of-Theseus progressively replaces BERT modules with smaller substitutes during training, achieving about 1.94x speed-up while retaining over 98% of BERT-base performance on GLUE, without extra distillation losses.

ABSTRACT

In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models. Compared to the previous knowledge distillation approaches for BERT compression, our approach does not introduce any additional loss function. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.

研究の動機と目的

  • Motivate reducing the size and compute of large Transformer models like BERT without relying on additional distillation losses.
  • Introduce a progressive module replacement (Theseus Compression) framework that alternates between predecessor and successor modules during training.
  • Demonstrate that the resulting compact model maintains near-original performance on GLUE while offering speed-ups.
  • Show that curriculum-based replacement scheduling yields better results than constant replacement rates.
  • Provide analysis on which layers and replacement strategies most impact performance.

提案手法

  • Partition the original BERT into modules and define compact substitute modules for each.
  • During training, replace predecessor modules with successor modules with a probability p, mixing both in a single forward pass.
  • Optimize only the task-specific loss (e.g., cross-entropy) while freezing predecessor embeddings and outputs to enable gradient flow across modules.
  • After convergence, assemble all successor modules into a full successor model and fine-tune it with the same loss.
  • Apply a Curriculum Learning-driven replacement scheduler to progressively increase the replacement probability over training steps.
  • Provide a simple linear scheduler p_d = min(1, kt + b) to control the dynamic replacement rate and warm up learning.

実験結果

リサーチクエスチョン

  • RQ1Can progressively replacing modules within a large pretrained model yield effective compression without extra distillation objectives?
  • RQ2Does curriculum-based scheduling improve the trade-off between compression and accuracy compared to constant replacement rates?
  • RQ3How does Theseus Compression compare to KD-based baselines on GLUE in terms of performance, speed, and model size?
  • RQ4Is the approach model-agnostic to Transformer-based architectures and potential to apply to other domains?

主な発見

MethodCoLAMNLIMRPCQNLIQQPRTESST-2STS-BMacro
BERT-base54.383.589.591.289.871.191.588.982.5
DistilBERT43.679.087.585.384.959.990.781.276.5
Vanilla KD45.180.186.288.088.164.990.584.978.5
BERT-PKD45.581.385.788.488.466.591.386.279.2
BERT-of-Theseus51.182.389.089.589.668.291.588.781.2
  • BERT-of-Theseus achieves 1.94x inference speed-up with 6-layer compressed models while retaining 98.4% (development) and 98.3% (test) of BERT-base performance on GLUE.
  • The method outperforms vanilla KD and PKD baselines on GLUE across most tasks.
  • A curriculum replacement scheduler consistently improves performance over constant replacement and anti-curriculum strategies.
  • Replacing earlier Transformer layers tends to hurt performance more than replacing later layers, indicating early layers contribute more to linguistic features.
  • Intermediate-task transfer (MNLI as a pretraining task) shows competitive or superior results on several tasks compared to DistilBERT and PD-BERT.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。