[论文解读] Model compression via distillation and quantization
本文介绍了两种方法——quantized distillation 和 differentiable quantization——通过将从全精度教师模型蒸馏到较浅、量化的学生模型来压缩深度网络,在准确性保留和跨视觉与语言任务的显著压缩方面具有强劲表现。
Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning. One aspect of the field receiving considerable attention is efficiently executing deep models in resource-constrained environments, such as mobile or embedded devices. This paper focuses on this problem, and proposes two new compression methods, which jointly leverage weight quantization and distillation of larger teacher networks into smaller student networks. The first method we propose is called quantized distillation and leverages distillation during the training process, by incorporating distillation loss, expressed with respect to the teacher, into the training of a student network whose weights are quantized to a limited set of levels. The second method, differentiable quantization, optimizes the location of quantization points through stochastic gradient descent, to better fit the behavior of the teacher model. We validate both methods through experiments on convolutional and recurrent architectures. We show that quantized shallow students can reach similar accuracy levels to full-precision teacher models, while providing order of magnitude compression, and inference speedup that is linear in the depth reduction. In sum, our results enable DNNs for resource-constrained environments to leverage architecture and accuracy advances developed on more powerful devices.
研究动机与目标
- 利用高精度全精度教师来提升压缩后的学生模型的性能。
- 结合蒸馏与权重量化,以实现深度和宽度的同时缩减。
- 在 CNN、RNN 和翻译任务上验证方法,以展示普适性和实际收益。
- 在标准基准上量化压缩和加速,同时在保持准确性的前提下进行量化。
提出的方法
- 用缩放、分桶以及统一/非统一方案定义权重量化。
- 引入 quantized distillation,即使用量化权重的蒸馏损失训练学生模型。
- 开发 differentiable quantization,通过 SGD 反向传播量化函数,学习量化点 p。
- 将方法应用于 CNNs(如 ResNet 变体)、Wide ResNets、OpenNMT 的 LSTMs,以及 WMT 翻译设置。
- 分析压缩带来的存储与推理加速,包括分桶和霍夫曼编码表示。
实验结果
研究问题
- RQ1蒸馏结合量化是否能产生适用于资源受限环境的高精度压缩模型?
- RQ2在视觉和语言任务中,quantized distillation 与 differentiable quantization 在准确性、收敛性和效率方面有何比较?
- RQ3位宽、分桶大小和架构对压缩-准确性权衡有何影响?
- RQ4在训练量化模型时,蒸馏损失是否优于标准损失?
- RQ5这些方法是否能扩展到大型数据集和架构(如 ImageNet、WMT)?
主要发现
- 量化的浅层学生可以接近全精度教师的准确性,同时实现数量级的压缩。
- 在 2 位和 4 位设置下,quantized distillation 在大多数任务上常常优于后处理量化和 differentiable quantization。
- 在 ImageNet 上,4 位量化、蒸馏的 2xResNet18 的准确性可达到与 ResNet34 教师相当的水平,同时模型更小、速度更快。
- 在 CIFAR-10 上,differentiable quantization 和 quantized distillation 在 4 位时接近教师的准确性,蒸馏损失带来更大提升。
- OpenNMT 与 WMT 实验表明蒸馏有助于在减小规模的情况下保持 BLEU 和困惑度接近教师水平。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。