QUICK REVIEW

[논문 리뷰] Mesh-TensorFlow: Deep Learning for Supercomputers

Noam Shazeer, Youlong Cheng|arXiv (Cornell University)|2018. 11. 05.

Advanced Neural Network Applications참고 문헌 21인용 수 52

한 줄 요약

Mesh-TensorFlow는 다차원 프로세서 메시에서 분산 텐서 연산을 지정하는 언어를 도입하여, 대형 모델(예: Transformers)을 TPU에서 확장 가능한 모델- 및 데이터-병렬 학습을 가능하게 하고, 최첨단 결과를 달성합니다.

ABSTRACT

Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the "batch" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh .

연구 동기 및 목표

대형 DNN에서 메모리 병목 및 지연을 해결하기 위해 순수 데이터 병렬성을 넘어 확장 가능한 학습을 고무한다.
다차원 프로세서 메시 전역에서 분산 텐서 연산을 지정하기 위한 언어로 Mesh-TensorFlow를 소개한다.
Mesh-TensorFlow 그래프를 집단 통신이 있는 SPMD 프로그램으로 컴파일하는 방법을 보인다.
TPU 클러스터에서 수십억 매개변수를 가진 Transformer 모델을 학습시켜 실용적 이점을 시연한다.

제안 방법

명명된 텐서 차원과 다차원 프로세서 메시를 정의한다.
텐서 차원을 메시 차원에 매핑하는 글로벌 계산 레이아웃을 지정한다.
각 텐서를 프로세서당 슬라이스로 표현하고, 연산을 로컬 계산과 가능한 집합통신(Allreduce)으로 구현한다.
분포된 샤드 간의 행렬 곱셈과 수축을 표현하기 위해 einsum 스타일의 연산(Einsum)과 축소 연산을 사용한다.
데이터-병렬, 모델-병렬, 및 혼합레이아웃을 제공하고 계산, 통신, 메모리 관점에서 성능 트레이드를 분석한다.

실험 결과

연구 질문

RQ1Mesh-TensorFlow가 데이터-병렬성을 넘어 광범위한 분산 텐서 연산을 표현하고 효율적으로 실행할 수 있는가?
RQ2다른 분산 레이아웃(데이터-병렬, 모델-병렬, 하이브리드)이 큰 TPU 메시에서의 통신, 메모리 및 확장성에 어떻게 영향을 미치는가?
RQ3대규모 클러스터에서 Transformer 유사 아키텍처에 Mesh-TensorFlow를 적용하여 어떤 성능 및 모델 크기 이점을 얻을 수 있는가?

주요 결과

A Mesh-TensorFlow 그래프는 병렬 연산과 MPI-유사 모음으로 SPMD 프로그램으로 컴파일된다.
데이터-병렬, 모델-병렬, 및 하이브리드 레이아웃은 TPU 메시에서 수십 억 매개변수에 이르는 Transformer 모델의 학습을 가능하게 한다.
최대 5 billion 매개변수를 가진 Transformer 모델을 최대 512 코어에서 학습시키며 WMT’14 En–Fr 번역과 One Billion Word 언어 모델링 벤치마크에서 최첨단 결과를 달성했다.
다차원 메시(예: 2D 512-core TPUs)를 사용하여 모델 크기와 어텐션 헤드를 확장하는 동안 계산 효율성을 상당히 유지(피크의 50% 이상).
이 방법은 데이터-병렬성과 모델-병렬성을 결합하여 프로세서 수에 비례하여 배치 크기와 모델 차원을 확장할 수 있게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.