Skip to main content
QUICK REVIEW

[论文解读] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee|arXiv (Cornell University)|Jun 30, 2020
Topic Modeling参考文献 75被引用 348
一句话总结

GShard 引入一种通用的基于注释的方法,以及基于 XLA 的 SPMD 编译器,用于训练带有 Sparsely-Gated MoE 层的巨型 Transformer 模型,实现计算量的亚线性增长,并在 2048 TPUs 上花费 4 天实现 600B 参数的多语言翻译。

ABSTRACT

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

研究动机与目标

  • Motivate the need for scaling neural networks to improve model quality while addressing practical challenges in computation, programming ease, and parallel deployment.

提出的方法

  • Extend Transformer with Position-wise Sparsely-Gated Mixture-of-Experts (MoE) layers to achieve sublinear computation scaling.
  • Introduce GShard, a module of lightweight annotation APIs plus an XLA compiler extension for automatic parallelization.
  • Adopt an SPMD (Single Program Multiple Data) partitioning strategy to keep compilation time constant regardless of device count.
  • Provide a design where model developers write as if on a single huge device, with automatic partitioning applied by the compiler.
  • Use a gating mechanism with expert capacity constraints and an auxiliary loss to balance load across thousands of experts.
  • Demonstrate end-to-end training and scaling on multilingual machine translation with 100 language pairs.

实验结果

研究问题

  • RQ1How can extremely large Transformer models be trained efficiently across thousands of devices without prohibitive compilation or communication overhead?
  • RQ2Can conditional computation via Sparsely-Gated MoE layers provide sublinear compute growth as model capacity increases?
  • RQ3Does an annotation-driven GShard approach simplify model development while enabling automatic, scalable partitioning on XLA?
  • RQ4What are the practical gains in translation quality when scaling to hundreds of billions of parameters in a multilingual setting?

主要发现

  • A 600B-parameter sparsely-gated MoE Transformer trained on 2048 TPU v3 devices for 4 days achieved superior translation quality for 100 languages to English compared to prior art.
  • Training cost increased sublinearly with model size, illustrating sublinear scaling of compute with increasing capacity.
  • A dense baseline Transformer (2.3B params) required 235.5 TPU v3 core-years, highlighting efficiency gains from the MoE approach.
  • GShard enables automatic partitioning and scales to thousands of devices with O(1) compilation time for the SPMD approach.
  • The MoE gating includes expert capacity constraints, an auxiliary loss to balance load, and random routing for second-best experts to utilize capacity effectively.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。