QUICK REVIEW

[論文レビュー] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Samyam Rajbhandari, Conglong Li|arXiv (Cornell University)|Jan 14, 2022

Domain Adaptation and Few-Shot Learning被引用数 55

ひとこと要約

本論文は DeepSpeed-MoE を提示し、PR-MoE と Mixture-of-Students を含むとともに、最適化された MoE 推論システムを提供します。自己回帰 MoE モデルに対して、トレーニングコストを up to 5x 削減し、推論を大幅に高速化・低コスト化します。

ABSTRACT

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.

研究の動機と目的

MoE の適用範囲を自己回帰型 NLG タスクへ広げ、品質を維持しつつトレーニングコストを削減する。
新しいアーキテクチャで MoE のパラメータ効率を向上させ、性能を損なうことなくモデルサイズを削減する。
スケーラブルな展開を念頭に、エンドツーエンドで高度に最適化された MoE 推論システムを開発する。
MoS の蒸留を導入し、推論をさらに高速化するために MoE モデルを圧縮する。

提案手法

Pyramid-Residual MoE (PR-MoE) を導入し、後半の層でより多くのエキスパートを割り当て、効率のために残差接続を使用する。
2 つの現象を検討する: (I) より深い MoE レイヤーはより多くのエキスパートからの恩恵を受けやすい; (II) 残差/Top2 構成は、標準 MoE に匹敵するか、より低い通信量でそれを上回る可能性がある。
Pyramid-MoE と Residual-MoE を組み合わせて、パラメータ効率のために PR-MoE を作成する。
DeepSpeed-MoE でフレキシブルなマルチエキスパートおよびマルチデータ並列性を実装し、ロードバランスを崩すことなく、異なるエキスパート数を持つ層全体で PR-MoE を訓練する。
Mixture-of-Students (MoS) を、ステージド知識蒸留を介して、より小さな深さの student が教師 PR-MoE を鏡像するようにして、スパーシティを維持しつつ開発する。
MoS および PR-MoS を訓練するための KD 形式を提案し、MoE のスパーシティと推論の利点を保持する。

実験結果

リサーチクエスチョン

RQ1MoE は自己回帰型 NLG に効果的に適用でき、品質を損なうことなくトレーニングコストを削減できるか？
RQ2PR-MoE は、標準の MoE と比較してパラメータ数を大幅に削減しつつ、モデル品質を維持または向上させられるか？
RQ3知識蒸留は、MoE の利点を保持し、推論をより高速化するMoS/PR-MoS のような小型 MoE モデルを生み出せるか？
RQ4低遅延かつコスト効果の高いエンドツーエンドの MoE 推論システムを、スケール（数百〜数千の GPU）で設計するにはどうすればよいか？

主な発見

MoE モデルは、密集モデルより検証損失が良好であり、より大きな dense モデルと同等の品質を、低いトレーニングコストで達成できる（たとえば 1.3B+MoE-128 が 6.7B dense と同等の品質を達成）。
トレーニングスループットは、同品質を達成する MoE モデルで、より大きな dense ベースラインと同等の品質を得るために、5x のコスト削減を示す。
PR-MoE は、標準 MoE と同程度の精度で、パラメータ数を最大 3x 削減する。
MoS 蒸留により、ゼロショット性能を維持しつつ MoE のサイズを最大 3.7x 削減できる。
DeepSpeed-MoE 推論は、既存の MoE 推論ソリューションと比較して最大 7.3x の遅延/コスト削減を実現し、 trillion-parameter MoE モデルでのレイテンシはサブ 25 ms 未満を超高速化。
PR-MoE/MoS の組み合わせは、品質の大幅な低下を招くことなく大規模な MoE ベースラインと比較して強力なパラメータ効率を実現する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。