QUICK REVIEW

[論文レビュー] Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Shaden Smith, Mostofa Patwary|arXiv (Cornell University)|Jan 28, 2022

Topic Modeling被引用数 299

ひとこと要約

本論文は MT-NLG 530B を提示します。DeepSpeed と Megatron を用いた 3D（データ、テンソル、パイプライン）並列性で訓練された、最大規模の単一モノリシックなトランスフォーマ言語モデルであり、インフラストラクチャ、データ整備、学習、評価結果（ゼロ/ワン/ファューショットの性能とバイアスを含む）を詳述します。

ABSTRACT

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG. We demonstrate that MT-NLG achieves superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results. We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.

研究の動機と目的

言語モデルのスケーリングを動機づけ、530B パラメータのモノリシックなトランスフォーマを訓練することを実証する。
3D 並列性の手法（データ、テンソル、パイプライン）と、効率的な訓練のためのトポロジー認識マッピングを説明する。
高品質な事前学習データを作成するためのデータセット整備、前処理、ブレンディングの詳細を述べる。
極大規模での訓練ダイナミクス、ハイパーパラメータ、安定性の考慮事項を提示する。
ゼロ/ワン/ファューショット設定全域での評価結果を報告し、バイアスや生成能力に関する所見を検討する。

提案手法

DeepSpeed と Megatron を用いてデータ、テンソル、パイプライン並列性を組み合わせた 3D 並列性を採用する。
ノード間・ノード内通信を最適化するためのトポロジー認識マッピングを活用する。
数千のGPUにまたがって、2048 の系列長を持つ 530B デコーダ専用トランスフォーマをグローバルバッチサイズ 1920 で事前学習する。
データの重複除去とタスクデータ削除を行い、The Pile や Common Crawl などからの多様な大規模データセットを整備・前処理する（使用例 ≈339B トークン; MT-NLG は 270B トークンで訓練）。
混合精度（16-bit bfloat16）と Adam オプティマイザを特定のハイパーパラメータで使用; 勾配クリッピングとウェイト減衰を適用; 学習率ウォームアップとコサイン減衰を実装。
lm-evaluation-harness スイートを用いて、複数のNLPタスクでゼロ-/ワン-/ファューショット prompting による評価を実施。

実験結果

リサーチクエスチョン

RQ1530B パラメータの自己回帰トランスフォーマを効率的に訓練するために、モデルと訓練インフラをどのようにスケールさせることができるか。
RQ2この規模での高品質な事前学習のために、どのデータ選別と前処理戦略が不可欠か。
RQ3標準的なNLPベンチマークにおける MT-NLG のゼロ-/ワン-/ファューショット能力はどうか、そして従来の巨大言語モデルと比較してどうか。
RQ4この規模で観察される特性（例：バイアス、インコンテキスト学習）とは何か。

主な発見

MT-NLG は複数のNLPベンチマークで最先端のゼロ/ワン/ファューショット精度を達成し、全設定で LAMBADA の新しい SOTA を確立。
モデルは複数のタスクで強力なインコンテキスト学習と生成能力を示す。
3D 並列性（データ、テンソル、パイプライン）とトポロジー認識マッピングにより、数千のGPU上で530Bパラメータのモデルを効率的に訓練可能。
慎重なデータ整備、フィルタリング、デデュプリケーション、およびタスクデータ削除が、モデル性能と安定性の重要な要因として特定。
検証損失曲線は事前学習中に着実な改善を示し、270B トークン後に低いクロスエントロピーに達する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。