QUICK REVIEW

[논문 리뷰] A Study of BFLOAT16 for Deep Learning Training

Dhiraj Kalamkar, Dheevatsa Mudigere|arXiv (Cornell University)|2019. 05. 29.

Tensor decomposition and applications참고 문헌 34인용 수 67

한 줄 요약

이 논문은 BFLOAT16을 딥러닝 학습용으로 강건한 반정밀도 형식으로 실증적으로 검증하고, 다양한 작업에서 FP32 결과와 일치하며 하이퍼파라미터 변경 없이 달성한다.

ABSTRACT

This paper presents the first comprehensive empirical study demonstrating the efficacy of the Brain Floating Point (BFLOAT16) half-precision format for Deep Learning training across image classification, speech recognition, language modeling, generative networks and industrial recommendation systems. BFLOAT16 is attractive for Deep Learning training for two reasons: the range of values it can represent is the same as that of IEEE 754 floating-point format (FP32) and conversion to/from FP32 is simple. Maintaining the same range as FP32 is important to ensure that no hyper-parameter tuning is required for convergence; e.g., IEEE 754 compliant half-precision floating point (FP16) requires hyper-parameter tuning. In this paper, we discuss the flow of tensors and various key operations in mixed precision training, and delve into details of operations, such as the rounding modes for converting FP32 tensors to BFLOAT16. We have implemented a method to emulate BFLOAT16 operations in Tensorflow, Caffe2, IntelCaffe, and Neon for our experiments. Our results show that deep learning training using BFLOAT16 tensors achieves the same state-of-the-art (SOTA) results across domains as FP32 tensors in the same number of iterations and with no changes to hyper-parameters.

연구 동기 및 목표

다양한 도메인에 걸친 딥러닝 학습에서 BFLOAT16의 효능 평가
BFLOAT16을 활용한 혼합 정밀도 학습에서 텐서 흐름과 핵심 연산 분석
인기 프레임워크에서 BFLOAT16의 에뮬레이션을 시연하고 FP32 기준선과 비교
BFLOAT16 사용 시 손실 스케일링 및 하이퍼파라미터 조정 필요성 평가
BFLOAT16 학습에서 하드웨어 및 소프트웨어 스택의 실용적 함의를 탐구

제안 방법

FP32 피연산자의 하위 16비트를 0으로 만들고 Round-to-Nearest Even (RNE) 반올림을 적용하여 BFLOAT16을 에뮬레이션
전향, 역전향 및 비-GEMM 연산에 대해 FP32 텐서를 BFLOAT16 형식으로 수정하는 Quantlib 개발
정확성을 위해 BFLOAT16 입력과 함께 FP32 누적기를 사용하고 FP32 가중치 업데이트를 유지
AlexNet, ResNet-50, DC-GAN, SR-GAN, DeepSpeech2, GNMT 및 두 개의 산업적 워크로드 전반에 걸쳐 평가
정확도 및 필요한 하이퍼파라미터 조정 측면에서 BFLOAT16과 FP16 및 INT16 비교
실험을 위해 Tensorflow, Caffe2, IntelCaffe, Neon에서 BFLOAT16 에뮬레이션 구현
하이퍼파라미터 변경 없이 FP32와 거의 동일한 학습 궤적 시연

실험 결과

연구 질문

RQ1비전, 음성, 언어, GAN 및 추천 시스템 전반에서 BFLOAT16 학습이 FP32와 동등한 정확도를 달성할 수 있는가?
RQ2BFLOAT16이 FP16 혼합 정밀도에서 일반적으로 요구되는 손실 스케일링과 하이퍼파라미터 조정 필요를 없애는가?
RQ3다양한 워크로드에서 BFLOAT16의 성능과 정확도가 FP16 및 INT16과 어떻게 비교되는가?
RQ4표준 프레임워크에서 GEMM 및 비-GEMM 연산을 포함한 학습 흐름에 대한 BFLOAT16의 영향은 무엇인가?
RQ5대규모 학습 파이프라인에서 BFLOAT16 채용에 대한 실질적인 하드웨어 및 소프트웨어 함의가 있는가?

주요 결과

BFLOAT16은 여러 도메인에서 같은 반복 횟수로 FP32와 동일한 최첨단 결과를 달성한다.
BFLOAT16 에뮬레이션으로 학습된 AlexNet 및 ResNet-50은 FP32 기준선과 유사한 top-1/top-5 정확도에 도달한다.
GNMT BLEU 점수는 BFLOAT16 하에서 FP32 기준선에 일치하거나 이를 상회한다.
BFLOAT16으로 학습된 GAN(DC-GAN, SR-GAN)은 FP32와 비슷한 inception 점수 및 SSIM 지표를 산출한다.
권장 시스템에 대한 산업 워크로드는 적절한 반올림을 사용할 때 BFLOAT16 사용 시 손실이 거의 없으며, 직접 잘림은 미세한 저하를 유발할 수 있다.
BFLOAT16 학습은 FP16 및 INT16 방식이 요구하는 하이퍼파라미터 조정과 복잡한 소프트웨어 관리 없이도 작동한다.
AVX512BF16에서의 고급 에뮬레이션은 FP32 누적으로 ResNet-50에서 최첨단 결과를 제공할 수 있음을 보여준다.
하드웨어 경로 실험에서 BFLOAT16 기반 학습은 최소한의 소프트웨어 변경으로 가능하며 향후 Xeon CPU 기능과 일치한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.