[論文レビュー] BinaryBERT: Pushing the Limit of BERT Quantization
BinaryBERT は BERT の重みを二値化することで、三値重み分割を用いて二値モデルを初期化およびファインチューニングすることで、GLUE および SQuAD でほとんど精度を損なうことなく約24x 小さいサイズを実現する。
The rapid development of large pre-trained language models has greatly increased the demand for model compression techniques, among which quantization is a popular solution. In this paper, we propose BinaryBERT, which pushes BERT quantization to the limit by weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscape. Therefore, we propose ternary weight splitting, which initializes BinaryBERT by equivalently splitting from a half-sized ternary network. The binary model thus inherits the good performance of the ternary one, and can be further enhanced by fine-tuning the new architecture after splitting. Empirical results show that our BinaryBERT has only a slight performance drop compared with the full-precision model while being 24x smaller, achieving the state-of-the-art compression results on the GLUE and SQuAD benchmarks.
研究の動機と目的
- Motivate model compression for large pre-trained language models to enable edge deployment.
- Investigate the feasibility and challenges of weight binarization for BERT.
- Propose a training workflow that bridges binary and full-precision models to preserve performance.
- Provide an adaptive splitting strategy to tailor binary model size to device constraints.
提案手法
- Analyze loss landscapes of full-precision, ternary, and binary BERT to identify optimization challenges.
- Introduce ternary weight splitting (TWS) to initialize BinaryBERT from a half-width ternary model with a splitting equivalence.
- Quantize activations and apply layer- or row-wise ternarization with separate scaling per split matrix after splitting.
- Incorporate knowledge distillation from a full-precision teacher during intermediate, prediction, and fine-tuning stages.
- Enable adaptive splitting to select which modules are ternary versus binary under resource constraints.
- Demonstrate adaptive splitting as a maximal-gain optimization to maximize performance under size/FLOP limits.
実験結果
リサーチクエスチョン
- RQ1Can binary weight quantization of BERT achieve acceptable performance relative to full-precision or ternary models?
- RQ2What mechanisms underlie the performance drop observed when moving from ternary to binary weights?
- RQ3Can a splitting-based training workflow (ternary weight splitting) initialize and fine-tune a binary BERT effectively?
- RQ4Does adaptive splitting yield better trade-offs between model size, FLOPs, and accuracy under edge-device constraints?
主な発見
- BinaryBERT achieves a small performance gap to full-precision BERT on GLUE and SQuAD while being 24x smaller.
- Direct binary training suffers from a steep, irregular loss landscape compared to full-precision and ternary models.
- Ternary weight splitting (TWS) initializes BinaryBERT from a half-width ternary model and preserves its performance after splitting.
- Adaptive splitting further improves results across model sizes by selecting the most quantization-sensitive modules to be ternary before splitting to binary.
- Across GLUE and SQuAD, BinaryBERT with splitting outperforms other binarization methods in most cases, especially with 4-bit activations.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。