QUICK REVIEW

[論文レビュー] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng|arXiv (Cornell University)|Jan 11, 2024

Topic Modeling被引用数 16

ひとこと要約

DeepSeekMoEは細粒度の専門家セグメンテーションと共有専門家の isolationsを導入し、非常に専門的なMoE専門家を実現。従来のMoEアーキテクチャより効率と性能を向上させ、2Bから145Bパラメータまでのスケーリングで実証。

ABSTRACT

In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-$K$ out of $N$ experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into $mN$ ones and activating $mK$ from them, allowing for a more flexible combination of activated experts; (2) isolating $K_s$ experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.

研究の動機と目的

MoEアーキテクチャにおける知識のハイブリディティと冗長性を動機づけ、解決する。
総パラメータや計算量を増やさず、専門家の特化を強化するDeepSeekMoEを提案。
2Bから145Bパラメータで競争的または優れた性能を示すスケーリングを実証。
細粒度のセグメンテーションと共有の孤立が冗長性を減らし、効率を改善することを検証。
チャット設定での教師ありファインチューニングによる整合性の展望と公開リリースを示す。

提案手法

2つの中核戦略を導入: (i) FFN中間層をmサブエキスパートに分割して細粒度の専門家セグメンテーションを行い、固定コストの下でmK個の専門家を活性化, (ii) Ks共有専門家を常時活性化として設計する共有専門家の孤立化により共通知識を統合。
細粒度ルーティングと共有専門家でMoE層を定式化し、総パラメータ数を維持。
ルーティング崩壊を緩和し計算を分散させるエキスパート-およびデバイスレベルのロードバランス損失を組み込む。
2BパラメータMoEバリアントで約100Bトークン規模の大規模多言語コーパスで訓練し、Hash Layer、Switch Transformer、GShardベースラインと比較。
実験を16Bおよび145Bへスケールし、密なモデルおよび大規模MoEベースラインに対する性能を評価。

実験結果

リサーチクエスチョン

RQ1細粒度の専門家セグメンテーションは組み合わせ的ルーティングの柔軟性を高め、総パラメータや計算量を増やさずに専門家の特化を改善できるか？
RQ2少数の共有専門家を孤立させることは冗長性を減らし、MoEモデルのパラメータ効率を改善するか？
RQ32B、16B、145BスケールでDeepSeekMoEはGShardおよび密集 baselinesと標準NLPベンチマークでどのように性能を示すか？
RQ4ルーティングバランス損失が訓練の安定性とモデル性能に与える影響は？
RQ5DeepSeekMoEはMoE性能の上限と一致し、実用的なGPU要件で公開可能な16Bモデルの公開を可能にするか？

主な発見

指標	# Shot	Dense	Hash Layer	Switch	GShard	DeepSeekMoE
Pile (Loss)	N/A	2.060	1.932	1.881	1.808	1.808
HellaSwag (Acc.)	0-shot	54.8	50.5	49.1	54.8	54.8
PIQA (Acc.)	0-shot	72.3	70.6	70.5	72.3	72.3
ARC-easy (Acc.)	0-shot	49.4	43.9	45.9	49.4	49.4
ARC-challenge (Acc.)	0-shot	34.3	31.6	30.2	34.3	34.3
RACE-middle (Acc.)	5-shot	44.0	42.1	43.6	44.0	44.0
RACE-high (Acc.)	5-shot	31.7	30.4	30.9	31.7	31.7
HumanEval (Pass@1)	0-shot	4.9	3.7	2.4	4.9	4.9
MBPP (Pass@1)	3-shot	2.2	0.2	0.4	2.2	2.2
TriviaQA (EM)	5-shot	16.6	10.2	8.9	16.6	16.6
NaturalQuestions (EM)	5-shot	5.7	3.2	2.5	5.7	5.7

DeepSeekMoE 2BはGShard 2Bを大幅に上回り、GShard 2.9B性能にほぼ匹敵する一方、総パラメータは同等で活性化パラメータは少ない。
DeepSeekMoE 2Bは同じ総パラメータの密なモデルの性能にほぼ近づき、MoEの上限性能に到達できることを示唆。
DeepSeekMoE 16Bは約40%の計算で、DeepSeek 7BおよびLLaMA2 7Bと同等の性能を達成し、活性化パラメータ数が類似のモデルを上回る。
145BではDeepSeekMoEはGShardより顕著な利点を示し、DeepSeek 67Bと同等程度の性能で計算量は28.5%（可能性として18.2%）のみ。
アブレーション研究は、細粒度セグメンテーションと共有の孤立の両方が性能向上と専門家の特化の高さに寄与することを確認。
分析は、DeepSeekMoEがルーティッド専門家間の冗長性を低く示し、共有専門家はルーティング専門家で代替不能であることを示唆。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。