QUICK REVIEW

[论文解读] MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Zhiqing Sun, Hongkun Yu|arXiv (Cornell University)|Apr 6, 2020

Topic Modeling参考文献 50被引用 90

一句话总结

MobileBERT 是一个任务无关、紧凑的 BERT 变体，设计用于移动设备；它比 BERT-BASE 小 4.3 倍且推理速度快 5.5 倍，同时在 GLUE 和 SQuAD 基准上保持竞争力。

ABSTRACT

Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resource-limited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine-tuning. Basically, MobileBERT is a thin version of BERT_LARGE, while equipped with bottleneck structures and a carefully designed balance between self-attentions and feed-forward networks. To train MobileBERT, we first train a specially designed teacher model, an inverted-bottleneck incorporated BERT_LARGE model. Then, we conduct knowledge transfer from this teacher to MobileBERT. Empirical studies show that MobileBERT is 4.3x smaller and 5.5x faster than BERT_BASE while achieving competitive results on well-known benchmarks. On the natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 (0.6 lower than BERT_BASE), and 62 ms latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).

研究动机与目标

激发并实现将类似 BERT 的模型部署到资源受限设备上。
设计一个深而窄的 Transformer 变体，通过瓶颈结构保持性能。
开发教师-学生知识迁移，从一个 inverted-bottleneck 教师中训练出薄型 MobileBERT。
优化运行时方面以降低移动端推理延迟。
在标准 NLP 基准上展示任务无关的微调能力。

提出的方法

引入瓶颈和 inverted-bottleneck 块，在缩窄宽度的同时保持深度。
在每层使用 4 个堆叠的 FFN 以重新平衡 MHA 与 FFN 参数分布。
训练一个具有 512 个特征图的深层教师 IB-BERT-LARGE，并进行逐层蒸馏至 MobileBERT。
将特征图传递和注意力传递作为逐层知识迁移目标。
结合 MLM、NSP 和 KD 损失进行预训练蒸馏。
探索训练策略：Auxiliary、Joint 和 Progressive Knowledge Transfer；逐步训练各层，并可选性微调下层。
嵌入因式分解：将嵌入维度降至 128，并应用 1D 卷积以恢复 512 维输出。
运行时优化：用 NoNorm 替代 LayerNorm，用 ReLU 替代 gelu 以降低延迟。

实验结果

研究问题

RQ1当通过具有 inverted bottlenecks 的教师进行逐层知识迁移训练时，深而窄的类 BERT 模型是否仍能在标准 NLP 基准上保持具有竞争力的性能？
RQ2哪种训练策略和架构选择在任务无关的压缩 BERT 中实现准确性、模型大小和移动端延迟之间的最佳平衡？
RQ3嵌入因式分解和运行时优化如何影响在移动设备上的准确性与实际延迟？
RQ4在 GLUE 和 SQuAD 上，MobileBERT 在提供显著加速的同时，在多大程度上接近 BERT-BASE 的性能？

主要发现

MobileBERT 实现了 4.3x 的模型大小减小和比 BERT-BASE 更快 5.5x 的推理速度。
在 GLUE 上，MobileBERT 获得 GLUE 分数 77.7，仅比 BERT-BASE 低 0.6 分，Pixel 4 上延迟 62 ms。
在 SQuAD v1.1/v2.0，MobileBERT 的开发集 F1 分别为 90.0/79.2，分别在 v1.1/v2.0 中领先 BERT-BASE 1.5/2.1 分。
MobileBERT-TINY 及量化变体在将尺寸进一步减小的同时几乎不损失精度，后训练量化带来额外压缩且退化极小。
运行时优化（NoNorm 和 ReLU）在不降低 FLOPs 的情况下显著减少实际延迟，凸显 FLOPs 与实际延迟之间的差距。
Progressive knowledge transfer 持续优于 auxiliary 或 joint 策略，提升了 GLUE 和 SQuAD 的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。