QUICK REVIEW

[论文解读] Mixed Precision Training With 8-bit Floating Point

Naveen Mellempudi, Sudarshan Srinivasan|arXiv (Cornell University)|May 29, 2019

Robotic Mechanisms and Dynamics参考文献 23被引用 41

一句话总结

本文使用 8-bit FP8 对权重、激活、误差和梯度进行训练，并使用 32-bit 累加器，在 Imagenet-1K 和 WMT16 的多种模型与任务中达到最先进的准确率。

ABSTRACT

Reduced precision computation for deep neural networks is one of the key areas addressing the widening compute gap driven by an exponential growth in model size. In recent years, deep learning training has largely migrated to 16-bit precision, with significant gains in performance and energy efficiency. However, attempts to train DNNs at 8-bit precision have met with significant challenges because of the higher precision and dynamic range requirements of back-propagation. In this paper, we propose a method to train deep neural networks using 8-bit floating point representation for weights, activations, errors, and gradients. In addition to reducing compute precision, we also reduced the precision requirements for the master copy of weights from 32-bit to 16-bit. We demonstrate state-of-the-art accuracy across multiple data sets (imagenet-1K, WMT16) and a broader set of workloads (Resnet-18/34/50, GNMT, Transformer) than previously reported. We propose an enhanced loss scaling method to augment the reduced subnormal range of 8-bit floating point for improved error propagation. We also examine the impact of quantization noise on generalization and propose a stochastic rounding technique to address gradient noise. As a result of applying all these techniques, we report slightly higher validation accuracy compared to full precision baseline.

研究动机与目标

推动低精度训练，以应对日益扩大深度学习计算差距。
提出在关键计算路径中，对权重、激活、误差和梯度使用 FP8 计算，并且不在关键路径中使用随机舍入。
展示在大规模数据集和模型上使用 FP8 进行训练，同时降低主副本权重精度。
解决损失缩放挑战和量化噪声，以维持或提升准确性。

提出的方法

对权重、激活、误差和梯度使用 FP8 (s=1,e=5,m=2)，并配合 32-bit FP 累加器。
在前向、反向和权重更新路径中插入量化操作，将 32-bit 输出下转换为 FP8。
应用损失缩放，防止梯度下溢并维持优化稳定性。
将主权重存储为 FP16，在计算路径上使用 FP32 进行更新，以便将 FP16 存储回存储器。
研究舍入模式并引入随机舍入，以减轻梯度噪声并提升泛化能力。

实验结果

研究问题

RQ1在卷积架构（ResNet 变体）和 NLP/Seq2Seq 模型上，FP8 混合精度训练是否能达到与 FP32 基线相当或更高的精度？
RQ2在训练过程中，哪些损失缩放策略和舍入方法最能应对 FP8 较小的 subnormal 范围？
RQ3FP8 如何影响在像 Imagenet-1K 和 WMT16 这样的大规模数据集上的收敛性、泛化性和内存效率？

主要发现

使用增强损失缩放的 FP8 训练在 Imagenet-1K 的 ResNet-18/34/50 上的 top-1 精度接近或略高于 FP32 基线（69.71 vs 69.23；72.95 vs 72.96；75.70 vs 75.47）。
在 FP32 累加器下的 FP8 训练在 ResNet 工作负载以及 WMT16 的 GNMT/Transformer 翻译任务中保持稳定的收敛性和准确性。
WMT16 的 FP8 BLEU 分数与 FP32 基线相当（GNMT 24.6 vs 24.3；Transformer 23.0 vs 23.6）。
对于某些模型（如 GNMT），需要使用带动态损失缩放的 FP8，以防止发散并改善泛化。
对激活/梯度进行随机舍入可以改善泛化，并使验证性能略有提升，相对于确定性舍入。
FP16 主副本和 FP8 计算使主权重量存储量减少 50%，而不降低准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。