[Paper Review] Mixed Precision Training With 8-bit Floating Point
The paper trains deep networks using 8-bit FP8 for weights, activations, errors, and gradients with a 32-bit accumulator, achieving state-of-the-art accuracy on Imagenet-1K and WMT16 across multiple models and tasks.
Reduced precision computation for deep neural networks is one of the key areas addressing the widening compute gap driven by an exponential growth in model size. In recent years, deep learning training has largely migrated to 16-bit precision, with significant gains in performance and energy efficiency. However, attempts to train DNNs at 8-bit precision have met with significant challenges because of the higher precision and dynamic range requirements of back-propagation. In this paper, we propose a method to train deep neural networks using 8-bit floating point representation for weights, activations, errors, and gradients. In addition to reducing compute precision, we also reduced the precision requirements for the master copy of weights from 32-bit to 16-bit. We demonstrate state-of-the-art accuracy across multiple data sets (imagenet-1K, WMT16) and a broader set of workloads (Resnet-18/34/50, GNMT, Transformer) than previously reported. We propose an enhanced loss scaling method to augment the reduced subnormal range of 8-bit floating point for improved error propagation. We also examine the impact of quantization noise on generalization and propose a stochastic rounding technique to address gradient noise. As a result of applying all these techniques, we report slightly higher validation accuracy compared to full precision baseline.
Motivation & Objective
- Motivate reduced-precision training to address the growing compute gap in deep learning.
- Propose an FP8 compute for weights, activations, errors, and gradients without stochastic rounding in the critical compute path.
- Demonstrate training with FP8 across large datasets and models while reducing master-copy weight precision.
- Address loss scale challenges and quantization noise to preserve or improve accuracy.
Proposed method
- Use FP8 (s=1,e=5,m=2) for weights, activations, errors, and gradients with a 32-bit FP accumulator.
- Insert quantization operations in forward, backward, and weight-update paths to down-convert 32-bit outputs to FP8.
- Apply loss scaling to prevent gradient underflow and maintain optimization stability.
- Store master weights in FP16 with FP32 on compute paths for updates to FP16 storage back to memory.
- Investigate rounding modes and introduce stochastic rounding to mitigate gradient noise and improve generalization.
Experimental results
Research questions
- RQ1Can FP8 mixed-precision training achieve comparable or superior accuracy to FP32 baselines on convolutional architectures (ResNet variants) and NLP/Seq2Seq models?
- RQ2What loss-scaling strategies and rounding methods best cope with the reduced subnormal range of FP8 during training?
- RQ3How does FP8 impact convergence, generalization, and memory efficiency across large datasets like Imagenet-1K and WMT16?
Key findings
- FP8 training with enhanced loss scaling achieves close to or slightly higher top-1 accuracy than FP32 baselines for ResNet-18/34/50 on Imagenet-1K (69.71 vs 69.23; 72.95 vs 72.96; 75.70 vs 75.47).
- FP8 training with FP32 accumulators maintains stable convergence and accuracy across ResNet workloads and GNMT/Transformer translation tasks on WMT16.
- WMT16 BLEU scores with FP8 are comparable to FP32 baselines (GNMT 24.6 vs 24.3; Transformer 23.0 vs 23.6).
- Dynamic loss-scaling strategies with FP8 are needed for some models (e.g., GNMT) to prevent divergence and improve generalization.
- Stochastic rounding of activations/gradients can improve generalization and lead to slightly better validation performance compared to deterministic rounding.
- FP16 master copies and FP8 compute enable a 50% reduction in the master weight storage without degrading accuracy.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.