QUICK REVIEW

[論文レビュー] PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation

Wei Zeng, Xiaozhe Ren|arXiv (Cornell University)|Apr 26, 2021

Topic Modeling参考文献 39被引用数 94

ひとこと要約

PanGu-α は 2048 Ascend 910 プロセッサ上で 5次元の自動並列化を用いた中国語自動回帰言語モデルを最大 200B パラメータで訓練し、1.1TB の高品質中国語コーパスを用いて、中国語 NLP タスク全般における few-shot/zero-shot 能力を実証する。

ABSTRACT

Large-scale Pretrained Language Models (PLMs) have become the new paradigm for Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with extit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive language models named PanGu-$α$, with up to 200 billion parameters. PanGu-$α$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-$α$, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-$α$ in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-$α$ in performing various tasks under few-shot or zero-shot settings.

研究の動機と目的

中国語事前学習言語モデルを英語中心の研究以上へスケーリングする動機づけ。
次トークン予測のための追加クエリ層を備えた Transformer ベースの自己回帰モデルを開発する。
多様なソースから高品質な 1.1TB の中国語コーパスを構築し、事前学習のために前処理する。
MindSpore Auto-parallel を用いた多デバイスでのスケーラブルな分散訓練を実証する。
多様な中国語 NLP タスクにおける few-shot および zero-shot の性能を評価する。

提案手法

次トークンを予測する追加のクエリ層を持つ一方向性 Transformer デコーダを使用する。
1.1TB の中国語コーパス上で 2.6B、13B、および 200B パラメータの PanGu-α モデルを訓練する。
topology-aware scheduling を備えた MindSpore Auto-parallel による 5次元並列 (データ、オペレーション・モデル、パイプライン・モデル、最適化子・モデル、リマテリアリゼーション) を適用する。
Q/K/V および入力に対する特定のシャーディング戦略で 2048 Ascend 910 プロセッサにモデルとデータを分割する。
40k BPE トークナイザと 1024 シーケンス長で事前学習を行い、次トークン予測の目的関数としてクロスエントロピーを使用する。
データ品質を手動評価とモデルベース評価の両方で評価し、データ品質の代理としてパープレキシティを含む。

実験結果

リサーチクエスチョン

RQ1PanGu-α は中国語言語モデルのパラメータ数とデータサイズをどのようにスケールさせるのか？
RQ25次元 Auto-parallelism は大規模 GPU/CPU クラスター上で 200B パラメータモデルの訓練を効率化できるか？
RQ3モデルサイズがパープレキシティと中国語 NLP タスクの few-shot/zero-shot 性能に与える影響は？
RQ4大規模なスケールで高品質な中国語事前学習データを得るデータ選別と前処理戦略は？
RQ5要約、QA、対話、その他タスクにおける PanGu-α の生成能力と few-shot 能力は？

主な発見

PanGu-α モデルはモデルサイズが大きくなるにつれてパープレキシティが低下する（2.6B: 19.33; 13B: 17.69; 200B: 15.59 の検証セット）。
200B モデルは訓練中に損失が約 2.49 へ収束し、さらなる訓練での改善の可能性を示唆する。
大規模な PanGu-α モデルは多様な中国語 NLP タスクで few-shot/zero-shot 設定においてより強い性能を達成する。
1.1TB の中国語コーパスは 80TB の生データから構築され、ルールベースのクリーニング、モデルベースのフィルタリング、重複除去を用いる。
5次元並列は topology-aware なスケジューリングを備え、2048 Ascend 910 プロセッサでエンドツーエンドの訓練を可能にする。
著者は同様の大規模事前学習設定を支援する MindSpore のオープンソース Auto-parallel ツールを提供している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。