QUICK REVIEW

[Paper Review] A Study of BFLOAT16 for Deep Learning Training

Dhiraj Kalamkar, Dheevatsa Mudigere|arXiv (Cornell University)|May 29, 2019

Tensor decomposition and applications34 references67 citations

TL;DR

The paper empirically validates BFLOAT16 as a robust half-precision format for DL training, matching FP32 results across diverse tasks without hyperparameter changes.

ABSTRACT

This paper presents the first comprehensive empirical study demonstrating the efficacy of the Brain Floating Point (BFLOAT16) half-precision format for Deep Learning training across image classification, speech recognition, language modeling, generative networks and industrial recommendation systems. BFLOAT16 is attractive for Deep Learning training for two reasons: the range of values it can represent is the same as that of IEEE 754 floating-point format (FP32) and conversion to/from FP32 is simple. Maintaining the same range as FP32 is important to ensure that no hyper-parameter tuning is required for convergence; e.g., IEEE 754 compliant half-precision floating point (FP16) requires hyper-parameter tuning. In this paper, we discuss the flow of tensors and various key operations in mixed precision training, and delve into details of operations, such as the rounding modes for converting FP32 tensors to BFLOAT16. We have implemented a method to emulate BFLOAT16 operations in Tensorflow, Caffe2, IntelCaffe, and Neon for our experiments. Our results show that deep learning training using BFLOAT16 tensors achieves the same state-of-the-art (SOTA) results across domains as FP32 tensors in the same number of iterations and with no changes to hyper-parameters.

Motivation & Objective

Evaluate BFLOAT16 efficacy for deep learning training across multiple domains
Analyze the flow of tensors and core operations in mixed-precision training with BFLOAT16
Demonstrate emulation of BFLOAT16 in popular frameworks and compare against FP32 baselines
Assess need for loss scaling and hyperparameter tuning when using BFLOAT16
Explore practical implications for hardware and software stacks in BFLOAT16 training

Proposed method

Emulate BFLOAT16 by zeroing lower 16 bits of FP32 operands and applying Round-to-Nearest Even (RNE) rounding
Develop Quantlib to modify FP32 tensors to BFLOAT16 format for forward, backward, and non-GEMM operations
Use FP32 accumulators with BFLOAT16 inputs and maintain FP32 weight updates for accuracy
Evaluate across AlexNet, ResNet-50, DC-GAN, SR-GAN, DeepSpeech2, GNMT, and two industrial workloads
Compare BFLOAT16 against FP16 and INT16 in terms of accuracy and required hyperparameter tuning
Implement BFLOAT16 emulation in Tensorflow, Caffe2, IntelCaffe, and Neon for experiments
Demonstrate near-identical training trajectories to FP32 with no hyperparameter changes

Experimental results

Research questions

RQ1Can BFLOAT16 training achieve equivalent accuracy to FP32 across vision, speech, language, GANs, and recommender systems?
RQ2Does BFLOAT16 eliminate the need for loss scaling and hyperparameter tuning typical of FP16 mixed precision?
RQ3How does BFLOAT16 performance and accuracy compare to FP16 and INT16 across diverse workloads?
RQ4What is the impact of BFLOAT16 on training flow, including GEMM and non-GEMM operations, in standard frameworks?
RQ5Are there practical hardware and software implications for adopting BFLOAT16 in large-scale training pipelines?

Key findings

BFLOAT16 achieves the same state-of-the-art results as FP32 in the same number of iterations across multiple domains.
AlexNet and ResNet-50 trained with BFLOAT16 emulation reach similar top-1/top-5 accuracies as FP32 baselines.
GNMT BLEU scores under BFLOAT16 match or exceed FP32 baselines for evaluated translation tasks.
GANs (DC-GAN, SR-GAN) trained with BFLOAT16 yield comparable inception scores and SSIM metrics to FP32.
Industrial workloads for recommendation systems show negligible loss when using BFLOAT16 with appropriate rounding; direct truncation can introduce small degradation.
BFLOAT16 training avoids hyperparameter tuning and complex software management required by FP16 and INT16 approaches.
Advanced emulation on AVX512BF16 shows BFLOAT16 can deliver state-of-the-art results on ResNet-50 with FP32 accumulation.
In hardware-path experiments, BFLOAT16-based training is feasible with minimal software changes, aligning with future Xeon CPU capabilities.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.