[论文解读] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard 引入一种通用的基于注释的方法,以及基于 XLA 的 SPMD 编译器,用于训练带有 Sparsely-Gated MoE 层的巨型 Transformer 模型,实现计算量的亚线性增长,并在 2048 TPUs 上花费 4 天实现 600B 参数的多语言翻译。
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
研究动机与目标
- Motivate the need for scaling neural networks to improve model quality while addressing practical challenges in computation, programming ease, and parallel deployment.
提出的方法
- Extend Transformer with Position-wise Sparsely-Gated Mixture-of-Experts (MoE) layers to achieve sublinear computation scaling.
- Introduce GShard, a module of lightweight annotation APIs plus an XLA compiler extension for automatic parallelization.
- Adopt an SPMD (Single Program Multiple Data) partitioning strategy to keep compilation time constant regardless of device count.
- Provide a design where model developers write as if on a single huge device, with automatic partitioning applied by the compiler.
- Use a gating mechanism with expert capacity constraints and an auxiliary loss to balance load across thousands of experts.
- Demonstrate end-to-end training and scaling on multilingual machine translation with 100 language pairs.
实验结果
研究问题
- RQ1How can extremely large Transformer models be trained efficiently across thousands of devices without prohibitive compilation or communication overhead?
- RQ2Can conditional computation via Sparsely-Gated MoE layers provide sublinear compute growth as model capacity increases?
- RQ3Does an annotation-driven GShard approach simplify model development while enabling automatic, scalable partitioning on XLA?
- RQ4What are the practical gains in translation quality when scaling to hundreds of billions of parameters in a multilingual setting?
主要发现
- A 600B-parameter sparsely-gated MoE Transformer trained on 2048 TPU v3 devices for 4 days achieved superior translation quality for 100 languages to English compared to prior art.
- Training cost increased sublinearly with model size, illustrating sublinear scaling of compute with increasing capacity.
- A dense baseline Transformer (2.3B params) required 235.5 TPU v3 core-years, highlighting efficiency gains from the MoE approach.
- GShard enables automatic partitioning and scales to thousands of devices with O(1) compilation time for the SPMD approach.
- The MoE gating includes expert capacity constraints, an auxiliary loss to balance load, and random routing for second-best experts to utilize capacity effectively.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。