QUICK REVIEW

[论文解读] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee|arXiv (Cornell University)|Jun 30, 2020

Topic Modeling参考文献 75被引用 348

一句话总结

GShard 引入一种通用的基于注释的方法，以及基于 XLA 的 SPMD 编译器，用于训练带有 Sparsely-Gated MoE 层的巨型 Transformer 模型，实现计算量的亚线性增长，并在 2048 TPUs 上花费 4 天实现 600B 参数的多语言翻译。

ABSTRACT

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

研究动机与目标

Motivate the need for scaling neural networks to improve model quality while addressing practical challenges in computation, programming ease, and parallel deployment.

提出的方法

Extend Transformer with Position-wise Sparsely-Gated Mixture-of-Experts (MoE) layers to achieve sublinear computation scaling.
Introduce GShard, a module of lightweight annotation APIs plus an XLA compiler extension for automatic parallelization.
Adopt an SPMD (Single Program Multiple Data) partitioning strategy to keep compilation time constant regardless of device count.
Provide a design where model developers write as if on a single huge device, with automatic partitioning applied by the compiler.
Use a gating mechanism with expert capacity constraints and an auxiliary loss to balance load across thousands of experts.
Demonstrate end-to-end training and scaling on multilingual machine translation with 100 language pairs.

实验结果

研究问题

RQ1How can extremely large Transformer models be trained efficiently across thousands of devices without prohibitive compilation or communication overhead?
RQ2Can conditional computation via Sparsely-Gated MoE layers provide sublinear compute growth as model capacity increases?
RQ3Does an annotation-driven GShard approach simplify model development while enabling automatic, scalable partitioning on XLA?
RQ4What are the practical gains in translation quality when scaling to hundreds of billions of parameters in a multilingual setting?

主要发现

A 600B-parameter sparsely-gated MoE Transformer trained on 2048 TPU v3 devices for 4 days achieved superior translation quality for 100 languages to English compared to prior art.
Training cost increased sublinearly with model size, illustrating sublinear scaling of compute with increasing capacity.
A dense baseline Transformer (2.3B params) required 235.5 TPU v3 core-years, highlighting efficiency gains from the MoE approach.
GShard enables automatic partitioning and scales to thousands of devices with O(1) compilation time for the SPMD approach.
The MoE gating includes expert capacity constraints, an auxiliary loss to balance load, and random routing for second-best experts to utilize capacity effectively.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。