[论文解读] Training wide residual networks for deployment using a single bit for each weight
本文在部署阶段对宽残差网络采用每权重1位,通过在训练中应用固定的每层权重缩放实现,在CIFAR-10/100、SVHN、ImageNet32和ImageNet上实现较强精度,并且与全精度基线具有竞争力。
For fast and energy-efficient deployment of trained deep neural networks on resource-constrained embedded hardware, each learned weight parameter should ideally be represented and stored using a single bit. Error-rates usually increase when this requirement is imposed. Here, we report large improvements in error rates on multiple datasets, for deep convolutional neural networks deployed with 1-bit-per-weight. Using wide residual networks as our main baseline, our approach simplifies existing methods that binarize weights by applying the sign function in training; we apply scaling factors for each layer with constant unlearned values equal to the layer-specific standard deviations used for initialization. For CIFAR-10, CIFAR-100 and ImageNet, and models with 1-bit-per-weight requiring less than 10 MB of parameter memory, we achieve error rates of 3.9%, 18.5% and 26.0% / 8.5% (Top-1 / Top-5) respectively. We also considered MNIST, SVHN and ImageNet32, achieving 1-bit-per-weight test results of 0.27%, 1.9%, and 41.3% / 19.1% respectively. For CIFAR, our error rates halve previously reported values, and are within about 1% of our error-rates for the same network with full-precision weights. For networks that overfit, we also show significant improvements in error rate by not learning batch normalization scale and offset parameters. This applies to both full precision and 1-bit-per-weight networks. Using a warm-restart learning-rate schedule, we found that training for 1-bit-per-weight is just as fast as full-precision networks, with better accuracy than standard schedules, and achieved about 98%-99% of peak performance in just 62 training epochs for CIFAR-10/100. For full training code and trained models in MATLAB, Keras and PyTorch see https://github.com/McDonnell-Lab/1-bit-per-weight/ .
研究动机与目标
- 通过使用1位权重网络在部署阶段大幅降低内存和能耗成本,同时尽量不显著降低精度。
- 开发一种训练策略,在保持训练效率的同时使推断阶段能够使用1位权重。
- 探索每层简单缩放以在训练过程中维持正确的梯度和激活缩放。
- 研究在过拟合数据集上不学习批量归一化的尺度/偏置参数的影响。
- 提供实用的训练和架构调整,以获得1位权重的竞争性结果。
提出的方法
- 以宽残差网络作为基线,在训练过程中对卷积层权重符号应用一个简单的固定缩放。
- 使用SGD以全精度权重更新进行训练,而前向/反向传播使用按层特定常数缩放的权重量符号,该常数等同于He初始化的标准差。
- 用在批归一化和全局平均池化之后的1x1卷积替代最后的权重层,从而实现部署时的1位存储。
- 采用暖启动学习率调度以加速1位权重模型的收敛。
- 在CIFAR-10/100和SVHN上不学习批量归一化的尺度/偏置参数以减少这些数据集的过拟合。
- 可选地在CIFAR-10/100上应用cutout数据增强以提高准确性。
实验结果
研究问题
- RQ11位权重网络在标准视觉基准上能多接近全精度残差网络?
- RQ2简单的逐层权重缩放是否能够在最小硬件复杂度下实现有效的1位训练和推理?
- RQ3不学习批量归一化的尺度和偏置参数对模型性能和过拟合有什么影响?
- RQ4暖启动学习率调度是否能将1位权重网络的训练加速到接近全精度的性能?
主要发现
- 在CIFAR-10/100上,1位权重的Wide ResNets取得的错误率显著低于先前的1位方法,缩小了以往的差距一半。
- 对于CIFAR,20-10宽度的1位网络在CIFAR-10上达到4.72%的错误率,在CIFAR-100上为19.35%,相比之下全精度分别为4.22%和18.76%。
- 结合cutout数据增强和1位权重,20-10宽度的CIFAR-10/100结果分别提升至3.92%和18.51%。
- 在ImageNet32和完整ImageNet上,1位权重模型在ImageNet32达到41.26%/19.08%(Top-1/Top-5),在完整ImageNet上通过多裁剪测试达到26.04/8.48(Top-1/Top-5),在给定内存约束下展示了具有竞争力的性能。
- 全精度与1位权重精度之间的差距往往会随着全精度错误率的上升而扩大,但当从更强的全精度基线开始并使用所提出的缩放方法时,这一差距会缩小。
- 不学习批量归一化的尺度/偏置参数在易过拟合的数据集(如CIFAR-10/100和SVHN)上提供显著改进,特别是与暖启动一起使用时。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。