QUICK REVIEW

[论文解读] HAQ: Hardware-Aware Automated Quantization with Mixed Precision

Kuan Wang, Zhijian Liu|arXiv (Cornell University)|Nov 21, 2018

Advanced Neural Network Applications参考文献 29被引用 49

一句话总结

HAQ 使用带硬件在回路的强化学习，自动为每一层分配混合精度比特宽度，优化边缘和云端加速器上的延迟、能耗和模型大小。它产生对硬件定制的量化策略，且精度损失极小。

ABSTRACT

Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals (latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.

研究动机与目标

在没有人工启发式的情况下自动搜索逐层混合精度量化策略。
将硬件反馈直接纳入优化循环，以优化真实硬件指标。
展示在多样化硬件架构（边缘 vs. 云端）中量化策略的专业化。
提供不同硬件特性如何影响最优量化策略的见解。

提出的方法

将量化建模为一个带有 DDPG 代理的强化学习问题。
使用逐层连续动作空间来选择每层的比特宽度，然后离散化为 {2,4,6,8} 位。
从硬件加速器直接获取延迟和能耗反馈，作为策略优化中的约束。
使用线性量化，对权重/激活进行逐层比特宽度量化，权重采用基于 KL 散度的裁剪。
对量化模型进行一次训练，并将验证准确度作为 RL 奖励（缩放）。
在多种硬件设置（边缘/云端、时空架构）下探索策略，以学习专业化策略。

实验结果

研究问题

RQ1具备硬件感知的强化学习能否自动为不同硬件架构发现最优的逐层比特宽度？
RQ2将量化策略专门化到特定硬件是否在延迟/能耗上带来显著提升且精度损失微小？
RQ3资源约束（延迟、能耗、模型大小）如何影响跨层的学习比特宽度分配？
RQ4从在边缘 vs. 云端以及不同加速器设计中学到的策略中，可以获得哪些关于神经网络与硬件设计的见解？

主要发现

与固定的 8 位量化相比，HAQ 将延迟降低 1.4× 到 1.95×，将能耗降低约 1.9×，且精度损失可忽略。
最优量化策略在不同硬件架构之间差异极大（边缘 vs. 云端，BISMO vs. BitFusion），说明需要硬件特定的优化。
Depthwise 与 pointwise 层在是否优化延迟、能耗或模型大小时，呈现出不同的比特宽度分配，反映了内存与计算瓶颈。
与基于规则的基线（如 PACT、Deep Compression）相比，在各种约束下，HAQ 能在相似或更小的模型大小下获得更高的准确率。
所学习的策略与 roofline 模型推理一致，将不同层的策略归因于目标硬件的内存带宽和计算容量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。