QUICK REVIEW

[论文解读] Mesh-TensorFlow: Deep Learning for Supercomputers

Noam Shazeer, Youlong Cheng|arXiv (Cornell University)|Nov 5, 2018

Advanced Neural Network Applications参考文献 21被引用 52

一句话总结

Mesh-TensorFlow 引入了一种语言，用于指定跨多维处理器网格的分布式张量计算，使得在 TPUs 上对大型模型（如 Transformers）进行可扩展的模型并行和数据并行训练成为可能，达到业界领先的结果。

ABSTRACT

Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the "batch" dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh .

研究动机与目标

激励超越纯数据并行的可扩展训练，以解决大型深度神经网络的内存瓶颈和延迟问题。
介绍 Mesh-TensorFlow 作为一种语言，用于在多维处理器网格上指定分布式张量计算。
展示如何将 Mesh-TensorFlow 图编译成具有集合通信的 SPMD 程序。
通过在 TPU 集群上训练具有数十亿参数的 Transformer 模型来展示实际收益。

提出的方法

定义命名的张量维度和多维处理器网格。
指定将张量维度映射到网格维度的全局计算布局。
将每个张量表示为每个处理器的切片，并将运算实现为本地计算，必要时使用聚集通信（Allreduce）。
使用爱因斯坦求和风格的运算（Einsum）和规约来表达跨分布式碎片的矩阵乘法与收缩。
提供数据并行、模型并行和混合布局，并从计算、通信和内存的角度分析它们的性能权衡。

实验结果

研究问题

RQ1Mesh-TensorFlow 是否能够表达并高效执行超出数据并行的广义分布式张量计算？
RQ2不同的分布式布局（数据并行、模型并行和混合）如何影响在大规模 TPU 网格上的通信、内存和可扩展性？
RQ3在大型集群上将 Mesh-TensorFlow 应用于类似 Transformer 的架构可以实现哪些性能和模型规模方面的收益？

主要发现

一个 Mesh-TensorFlow 图会编译成具有并行操作和类似 MPI 的聚集通信的 SPMD 程序。
数据并行、模型并行和混合布局使在 TPU 网格上训练具有数十亿参数的 Transformer 模型成为可能。
在多达 512 个核心上训练高达 5 billion 参数的 Transformer 模型，在 WMT’14 En–Fr 翻译和 One Billion Word 语言建模基准上取得了业界领先的结果。
使用多维网格（例如 2D 512-core TPUs）在扩大模型大小和注意力头数量的同时，保持了相当高的计算效率（超过峰值的 50% 以上）。
该方法允许将数据并行和模型并行结合起来，使批量大小和模型维度能够与处理器数量成比例地扩展。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。