Skip to main content
QUICK REVIEW

[论文解读] Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Shaden Smith, Mostofa Patwary|arXiv (Cornell University)|Jan 28, 2022
Topic Modeling被引用 299
一句话总结

该论文提出 MT-NLG 530B,这是在 3D(data、tensor、pipeline)并行性下以 DeepSpeed 和 Megatron 训练的最大单体变压器语言模型,详细介绍基础设施、数据整理、训练和评估结果,包括零-shot/一-shot/少量-shot 性能与偏见。

ABSTRACT

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.

研究动机与目标

  • Motivate scaling up language models and demonstrate training a 530B-parameter monolithic transformer.
  • Describe the 3D parallelism methodology (data, tensor, pipeline) and its topology-aware mapping for efficient training.
  • Detail the dataset curation, preprocessing, and blending to create high-quality pretraining data.
  • Present training dynamics, hyperparameters, and stability considerations at extreme scale.
  • Report evaluation results across zero-/one-/few-shot settings and discuss observations on biases and generation capabilities.

提出的方法

  • Adopt 3D parallelism combining data, tensor, and pipeline parallelism with DeepSpeed and Megatron.
  • Utilize topology-aware mapping to optimize inter- and intra-node communication.
  • Pretrain a 530B decoder-only Transformer with 2048 sequence length and a global batch size of 1920 across thousands of GPUs.
  • Curate and preprocess a large diverse dataset (≈339B tokens used; MT-NLG trained on 270B tokens) from sources including The Pile and Common Crawl with deduplication and task-data removal.
  • Employ mixed precision (16-bit bfloat16) and Adam optimizer with specific hyperparameters; apply gradient clipping and weight decay; implement learning rate warmup and cosine decay.
  • Evaluate with zero-/one-/few-shot prompting on multiple NLP tasks using the lm-evaluation-harness suite.

实验结果

研究问题

  • RQ1How can model and training infrastructure be scaled to train a 530B parameter autoregressive transformer efficiently?
  • RQ2What data curation and preprocessing strategies are essential for high-quality pretraining at this scale?
  • RQ3What are the zero-/one-/few-shot capabilities of MT-NLG on standard NLP benchmarks, and how do they compare to prior giant-language models?
  • RQ4What are observed properties (e.g., biases, in-context learning) of MT-NLG at this scale?

主要发现

  • MT-NLG achieves state-of-the-art zero-/one-/few-shot accuracies on several NLP benchmarks, including establishing new SOTA on LAMBADA across all settings.
  • The model demonstrates strong in-context learning and generation capabilities on multiple tasks.
  • 3D parallelism (data, tensor, pipeline) with topology-aware mapping enables efficient training of a 530B parameter model on thousands of GPUs.
  • Careful data curation, filtering, deduplication, and task-data removal are identified as key ingredients for model performance and stability.
  • Validation loss curves show progressive improvement during pretraining, reaching low cross-entropy after 270B tokens.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。