QUICK REVIEW

[论文解读] Quantizing deep convolutional networks for efficient inference: A whitepaper

Raghuraman Krishnamoorthi|arXiv (Cornell University)|Jun 21, 2018

Advanced Neural Network Applications被引用 756

一句话总结

本白皮书回顾对 CNN 权重和激活进行后训练量化和量化感知方法，将其量化到 8、4 或 16 位，分析在不同架构上的精度影响，并提供 TensorFlow 工具和面向边缘设备高效推理的训练最佳实践。

ABSTRACT

We present an overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations. Per-channel quantization of weights and per-layer quantization of activations to 8-bits of precision post-training produces classification accuracies within 2% of floating point networks for a wide variety of CNN architectures. Model sizes can be reduced by a factor of 4 by quantizing weights to 8-bits, even when 8-bit arithmetic is not supported. This can be achieved with simple, post training quantization of weights.We benchmark latencies of quantized networks on CPUs and DSPs and observe a speedup of 2x-3x for quantized implementations compared to floating point on CPUs. Speedups of up to 10x are observed on specialized processors with fixed point SIMD capabilities, like the Qualcomm QDSPs with HVX. Quantization-aware training can provide further improvements, reducing the gap to floating point to 1% at 8-bit precision. Quantization-aware training also allows for reducing the precision of weights to four bits with accuracy losses ranging from 2% to 10%, with higher accuracy drop for smaller networks.We introduce tools in TensorFlow and TensorFlowLite for quantizing convolutional networks and review best practices for quantization-aware training to obtain high accuracy with quantized weights and activations. We recommend that per-channel quantization of weights and per-layer quantization of activations be the preferred quantization scheme for hardware acceleration and kernel optimization. We also propose that future processors and hardware accelerators for optimized inference support precisions of 4, 8 and 16 bits.

研究动机与目标

推动量化以在边缘推理中实现更小的模型尺寸、内存占用和功耗。
描述在多种 CNN 架构下，8、4 和 16 位量化的精度和延迟权衡。
开发并评估后训练量化和量化感知训练两种方法。
为在 TensorFlow/TensorFlow Lite 中实现量化模型提供实用指南和工具。

提出的方法

提出用于权重和激活的统一仿射、对称和随机量化器。
在模拟量化和直通估计器下推导前向和反向传播的公式。
在 ImageNet 上对如 MobileNet、Inception、ResNet、NasNet 等网络，评估后训练量化（仅权重和权重-激活两种）以及量化感知训练。
分析量化粒度（按层 vs 按通道）对精度的影响。
讨论批量归一化的处理和用于量化推理的计算友好折叠策略。
提供 TensorFlow 工具和用于量化与部署的实用工作流。

实验结果

研究问题

RQ1哪些量化方案（按通道权重、按层激活）在常见 CNN 架构中能保持接近浮点精度？
RQ2后训练量化与量化感知训练在 8 位及以下位宽下的精度差异如何？
RQ3批量归一化处理对量化 CNN 的精度与稳定性有何影响？
RQ4哪些工具与训练工作流可实现量化模型在 TensorFlow/TensorFlow Lite 的实际部署？
RQ5使用量化网络时，对 CPU、DSP 和专用加速器的延迟和模型大小的好处有哪些？

主要发现

按通道权重量化，配合按层激活的 8 位精度，在许多架构上的结果约比 FP32 高 2% 左右。
权重量化为 8 位，在即使没有 8 位运算的情况下，通过后训练量化，模型尺寸可减小约 4 倍。
量化网络在 CPU 上运行更快（2x–3x），在定点 SIMD 硬件上显著更快（高达约 10x）。
量化感知训练将 FP32 差距缩小到约 1%（8 位精度），并在不同架构下使 4 位权重量化的精度损失为 2%–10%。
在对权重量化时，按通道权重量化通常优于按层，特别是在 4 位精度并进行微调时。
激活量化为 8 位由于归一化策略（如 BatchNorm 简化或 ReLU6）而几乎不增加精度损失。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。