QUICK REVIEW

[論文レビュー] BinaryBERT: Pushing the Limit of BERT Quantization

Haoli Bai, Wei Zhang|arXiv (Cornell University)|Dec 31, 2020

Topic Modeling参考文献 69被引用数 45

ひとこと要約

BinaryBERT は BERT の重みを二値化することで、三値重み分割を用いて二値モデルを初期化およびファインチューニングすることで、GLUE および SQuAD でほとんど精度を損なうことなく約24x 小さいサイズを実現する。

ABSTRACT

The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper, we propose BinaryBERT, which pushes BERT quantization to the limit by weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscape. Therefore, we propose ternary weight splitting, which initializes BinaryBERT by equivalently splitting from a half-sized ternary network. The binary model thus inherits the good performance of the ternary one, and can be further enhanced by fine-tuning the new architecture after splitting. Empirical results show that our BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks.

研究の動機と目的

Motivate model compression for large pre-trained language models to enable edge deployment.
Investigate the feasibility and challenges of weight binarization for BERT.
Propose a training workflow that bridges binary and full-precision models to preserve performance.
Provide an adaptive splitting strategy to tailor binary model size to device constraints.

提案手法

Analyze loss landscapes of full-precision, ternary, and binary BERT to identify optimization challenges.
Introduce ternary weight splitting (TWS) to initialize BinaryBERT from a half-width ternary model with a splitting equivalence.
Quantize activations and apply layer- or row-wise ternarization with separate scaling per split matrix after splitting.
Incorporate knowledge distillation from a full-precision teacher during intermediate, prediction, and fine-tuning stages.
Enable adaptive splitting to select which modules are ternary versus binary under resource constraints.
Demonstrate adaptive splitting as a maximal-gain optimization to maximize performance under size/FLOP limits.

実験結果

リサーチクエスチョン

RQ1Can binary weight quantization of BERT achieve acceptable performance relative to full-precision or ternary models?
RQ2What mechanisms underlie the performance drop observed when moving from ternary to binary weights?
RQ3Can a splitting-based training workflow (ternary weight splitting) initialize and fine-tune a binary BERT effectively?
RQ4Does adaptive splitting yield better trade-offs between model size, FLOPs, and accuracy under edge-device constraints?

主な発見

BinaryBERT achieves a small performance gap to full-precision BERT on GLUE and SQuAD while being 24x smaller.
Direct binary training suffers from a steep, irregular loss landscape compared to full-precision and ternary models.
Ternary weight splitting (TWS) initializes BinaryBERT from a half-width ternary model and preserves its performance after splitting.
Adaptive splitting further improves results across model sizes by selecting the most quantization-sensitive modules to be ternary before splitting to binary.
Across GLUE and SQuAD, BinaryBERT with splitting outperforms other binarization methods in most cases, especially with 4-bit activations.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。