QUICK REVIEW

[论文解读] Rethinking floating point for deep learning

Jeff Johnson|arXiv (Cornell University)|Nov 1, 2018

Neural Networks and Applications参考文献 25被引用 104

一句话总结

本文提出了一种8位对数浮点数，结合Kulisch累积和ELMA，在不重新训练的情况下在ResNet-50上实现接近float32的精度，并在28 nm ASIC硬件上相对于8/32位整数MAC显示出能耗/面积优势。

ABSTRACT

Reducing hardware overhead of neural networks for faster or lower power inference and training is an active area of research. Uniform quantization using integer multiply-add has been thoroughly investigated, which requires learning many quantization parameters, fine-tuning training or other prerequisites. Little effort is made to improve floating point relative to this baseline; it remains energy inefficient, and word size reduction yields drastic loss in needed dynamic range. We improve floating point to be more energy efficient than equivalent bit width integer hardware on a 28 nm ASIC process while retaining accuracy in 8 bits with a novel hybrid log multiply/linear add, Kulisch accumulation and tapered encodings from Gustafson's posit format. With no network retraining, and drop-in replacement of all math and float32 parameters via round-to-nearest-even only, this open-sourced 8-bit log float is within 0.9% top-1 and 0.2% top-5 accuracy of the original float32 ResNet-50 CNN model on ImageNet. Unlike int8 quantization, it is still a general purpose floating point arithmetic, interpretable out-of-the-box. Our 8/38-bit log float multiply-add is synthesized and power profiled at 28 nm at 0.96x the power and 1.12x the area of 8/32-bit integer multiply-add. In 16 bits, our log float multiply-add is 0.59x the power and 0.68x the area of IEEE 754 float16 fused multiply-add, maintaining the same signficand precision and dynamic range, proving useful for training ASICs as well.

研究动机与目标

探究是否可以在不重新训练或不进行大量量化的情况下，使浮点表示更具能效用于神经网络。
开发一种新的运算，既保留一般浮点语义，又在小位宽下提高动态范围和能效。
在标准CNN上评估所提运算，并在ASIC和FPGA平台上与传统整数量化基线进行比较。

提出的方法

提出一种8-bit对数域表示，灵感来自posit (N,s) 与Gustafson的渐进截断，以及通过Kulisch累积实现的ELMA（exact log-linear multiply-add）。
结合对数域乘法与线性域累积，以实现能源高效的EMA/ELMA操作。
将所有数学运算和float32参数替换为就地近似舍入到最近偶数的转换，无需网络重新训练。
在28 nm ASIC上进行综合并使用32x32矩阵乘累加和8/38-bit对数MADD进行评估，与IEEE/标准替代方案比较功耗/面积。

实验结果

研究问题

RQ18-bit浮点类表示是否能够在不重新训练的情况下在CNN上保持接近float32的精度？
RQ2对数域与Kulisch累积技术是否在硬件上相对于传统的int8/32量化提供能耗和面积优势？
RQ3采用基于ELMA的乘加在CNN工作负载中对精度、延迟和硬件资源的权衡如何？

主要发现

8-bit对数浮点配合ELMA在ImageNet的ResNet-50上，top-1准确率比float32低0.9%以内、top-5低0.2%以内且无需重新训练。
在28 nm ASIC上，8/38-bit对数MADD的功耗为8/32-bit整数MADD的0.96x，面积为1.12x。
在16-bit配置下，对数MADD的功耗为IEEE-754 float16 FMA的0.59x，面积为0.68x，且保持相近的尾数精度和动态范围。
FPGA实验显示在ELMA结合Kulisch累积的基线上，与int8/32 MAC相比具有竞争力的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。